Frequently Occurred Information Extraction from a Collection of Labeled Trees

Paik, Ju-Ryon;Nam, Jung-Hyun;Ahn, Sung-Joon;Kim, Ung-Mo;

Journal of Internet Computing and Services (인터넷정보학회논문지)

Volume 10 Issue 5
/
Pages.65-78
/
2009
/
1598-0170(pISSN)
/
2287-1136(eISSN)

Korean Society for Internet Information (한국인터넷정보학회)

Frequently Occurred Information Extraction from a Collection of Labeled Trees

라벨 트리 데이터의 빈번하게 발생하는 정보 추출

백주련 (성균관대학교 정보통신공학부) ;
남정현 (건국대학교 컴퓨터응용과학부) ;
안성준 (성균관대학교 정보통신공학부) ;
김응모 (성균관대학교 정보통신공학부)

Published : 2009.10.30

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

The most commonly adopted approach to find valuable information from tree data is to extract frequently occurring subtree patterns from them. Because mining frequent tree patterns has a wide range of applications such as xml mining, web usage mining, bioinformatics, and network multicast routing, many algorithms have been recently proposed to find the patterns. However, existing tree mining algorithms suffer from several serious pitfalls in finding frequent tree patterns from massive tree datasets. Some of the major problems are due to (1) modeling data as hierarchical tree structure, (2) the computationally high cost of the candidate maintenance, (3) the repetitious input dataset scans, and (4) the high memory dependency. These problems stem from that most of these algorithms are based on the well-known apriori algorithm and have used anti-monotone property for candidate generation and frequency counting in their algorithms. To solve the problems, we base a pattern-growth approach rather than the apriori approach, and choose to extract maximal frequent subtree patterns instead of frequent subtree patterns. The proposed method not only gets rid of the process for infrequent subtrees pruning, but also totally eliminates the problem of generating candidate subtrees. Hence, it significantly improves the whole mining process.

트리 데이터로부터 유용한 정보들을 추출하는 가장 일반적인 방식은 빈번하게 자주 발생하는 서브트리 패턴들을 얻는 것이다. XML 마이닝, 웹 사용 마이닝, 바이오인포매틱스, 네트워크 멀티캐스트 라우팅 등 빈번 트리 패턴 마이닝은 여러 다양한 영역에서 광범위하게 이용되고 있기 때문에, 해당 패턴들을 추출하기 위한 많은 알고리즘들이 제안되어 왔다. 하지만, 현재까지 제안된 대부분의 트리 마이닝 알고리즘들은 여러 가지 심각한 문제점들을 내포하고 있는데 이는 특히 대량의 트리 데이터 집합을 대상으로 했을 때는 더 심각해진다. 주요하게 발생하는 문제점들로는, (1) 계층적 트리 구조의 데이터 모델링, (2) 후보군 유지를 위한 고비용 계산, (3) 반복적인 입력 데이터 집합 스캔, (4) 높은 메모리 의존성이 대표적이다. 이런 문제점들을 발생하게 하는 주요 원인은, 대부분의 기존 알고리즘들이 apriori 방식에 근거하고 있다는 점과 후보군 생성과 빈발 횟수 집계에 anti-monotone 원리를 적용한다는 점에 기인한다. 언급한 문제들을 해결하기 위해, 본 저자들은 apriori 방식 대신 pattern-growth 방식을 기반으로 하며, 빈번 서브트리 추출 대신 최대 빈번 서브트리 추출을 목적으로 한다. 이를 통해 제안된 방법은, 빈번하지 않은 서브트리들을 제거하는 과정 자체를 배제할 뿐만 아니라, 후보군 트리들을 생성하는 과정 또한 전혀 수행하지 않음으로써 전체 마이닝 과정을 상당히 개선한다.

Keywords

References

R. Praveen, M. Bongki, "Prix: Indexing and Querying XML Using Prüfer Sequences," Proc.of IEEE Int’l Conf. on Data Mining, pp.288-299, 2004.
L. I. Rusu, W. Rahayu, T. Taniar, "Mining Changes from Versions of Dynamic XMLDocuments," Proc. of Int’l Conf. on Knowledge Discovery from XML Documents, LNCS vol.3915, pp. 3-12, 2006.
S. L. T. Adali, M. Magdon-Ismail, "Optimal Link Bombs are Uncordinated," Proc. of the 1st Workshop on Adversarial Information Retrieval on the Web, pp. 487-499, 1994.
S. Zhang, J. T. L. Wang, "Mining Frequent Agreement Subtrees in Phylogenetic Databases," Proc. of the 6th SIAM Int’l Conf. on Data Mining, pp. 222-233, 2006.
J. Cui, J. Kim, D. Maggiorini, K. Boussetta, M. Gerla, "Aggregated Multicast―A Comparative Study," Cluster Computing, 8(1), pp. 15-26, 2005. https://doi.org/10.1007/s10586-004-4433-8
Y. Chi, S. Nijssen, R. R. Maggiorini, J. N. Kok, "Frequent Subtree Mining-An Overvire," Fundamental Informaticae, 66(1-2), pp. 161-198, 2005.
Y. Chi, Y. Xia, Y. Yang, R. R. Muntz, "Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees," Proc. of the 16th Int’l Conf. on Scientific and Statistical Database Management, pp. 11-20, 2004.
R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. of the 20th Int’l Conf. on Very Large Databases, pp. 487-499, 1994.
J. Han, J. Pei, Y. Yin, "Mining Frequent Pattern without Candidate Generation," Proc. of ACM SIGMOD Int’l Conf. on Management of Data, pp. 1-12, 2000.
K. Wang, H. Liu, "Schema Discovery for Semistructured Data," Proc. of the 3rd Int’l Conf. on Knowledge Discovery and Data Mining, pp. 271-274, 1997.
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, S. Arikawa, "Efficient Substructure Discovery from Large Semi-structured Data," Proc. of the 2nd SIAM Int’l Conf. on Data Mining, pp. 158-174, 2002.
M. J. Zaki, "Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications," IEEE Transactions on Knowledge and Data Engineering, 17(8), pp. 1021-1035, 2005. https://doi.org/10.1109/TKDE.2005.125
Y. Chi, Y. Yang, R. R. Muntz, "Canonical Forms for Labeled Trees and TheirApplications in Frequent Subtree Mining," Knowledge and Information Systems, 8(2), pp.203-234, 2005. https://doi.org/10.1007/s10115-004-0180-7
C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, B. Shi, "Efficient Pattern-growth Methods for Frequent Tree Pattern Mining," Proc. of the 8th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, LNAI vol.3056, pp. 441-451, 2004.
Y. Xiao, J.-F. Yao, Z. Li, M. H. Dunham, "Efficient Data Mining for Maximal FrequentSubtrees," Proc. of IEEE Int’l Conf. on Data Mining, pp. 379-386, 2003.
Y. Chi, Y. Xia, Y. Yang, R. R. Muntz, "Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees," IEEE Transactions on Knowledge and Data Engineering, 17(3), pp. 190-202, 2005. https://doi.org/10.1109/TKDE.2005.30
J. Paik, U. M. Kim, "A Simple yet Efficient Approach for Maximal Frequent SubtreesExtraction from a Collection of XML Documents," Proc. of the 7th Int’l Conf. onWeb Information Systems Engineering, pp.94-103, 2006.
J. Paik, J. Lee, J. Nam, U. M. Kim, "Mining Maximally Common Substructures from XML Trees with Lists-based Pattern-growth Method," Proc. of IEEE Int’l Conf. on Computational Intelligence and Security, pp. 209-213, 2007.
A. Termier, M.-C. Rousset, M. Sebag, "Treefinder: A First Step towards XML DataMining," Proc. of IEEE Int’l Conf. on Data Mining, pp. 450-457, 2002.