DOI QR코드

DOI QR Code

Clustering XML Documents Considering The Weight of Large Items in Clusters

클러스터의 주요항목 가중치 기반 XML 문서 클러스터링


Abstract

As the web document of XML, an exchange language of data in the advanced Internet, is increasing, a target of information retrieval becomes the web documents. Therefore, there we researches on structure, integration and retrieval of XML documents. This paper proposes a clustering method of XML documents based on frequent structures, as a basic research to efficiently process query and retrieval. To do so, first, trees representing XML documents are decomposed and we extract frequent structures from them. Second, we perform clustering considering the weight of large items to adjust cluster creation and cluster cohesion, considering frequent structures as items of transactions. Third, we show the excellence of our method through some experiments which compare which the previous methods.

발달된 인터넷 환경과 데이터 교환 표준 언어로서 확정되고 있는 XML을 기반으로 하여 대량의 웹 문서들이 생산되면서 정보 추출의 대상은 자연스럽게 웹 문서로 이동하게 되었다. 이에 따라 급속히 증가하고 있는 XML 문서에 대한 구조, 통합 및 검색을 위한 연구들이 있다. 이 논문에서는 XML 문서들에 대한 질의 처리, 검색 등을 효율적으로 처리하기 위한 기반으로써 빈발구조 중심의 XML 문서를 클러스터링 하는 방법을 제안한다. 첫째 XML 문서를 트리 구조로 표현하여 분리하고 분리된 구조들을 대상으로 빈발하게 발생하는 구조들을 추출한다. 둘째 각 XML 문서에서 추출된 빈발 구조들을 트랜잭션의 항목으로 취급하여 클러스터링을 수행한다. 클러스터링을 수행할 때 각 클러스터의 생성 및 생성된 전체 클러스터의 응집도를 함께 고려하는 주요항목 가중치를 이용한다. 셋째 기존연구와의 비교 실험을 통해 제안하는 방법의 우수성을 증명한다.

Keywords

References

  1. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda, 'Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents,' The 6th Pacific Asia Conference, Advances in Knowledge Discovery and Data Mining (PAKDD), 2002 https://doi.org/10.1007/3-540-47887-6
  2. J. T. Wang, D. Shasha, G. J. S. Chang, 'Structural Matching and Discovery in Document Databases,' Proceedings of the ACM SIGMOD on Management of Data, 1997 https://doi.org/10.1145/253262.253406
  3. A. Doucet, H. A. Myka, 'Naive Clustering of a Large XML Document Collection,' Proceedings of INEX Workshop, 2002
  4. J. Yoon, V. Raghavan, V. Chakilam, 'BitCube: Clustering and Statistical Analysis for XML Documents,' Proceedings of the International Conference on Scientific and Statistical Database Management, 2001
  5. M. L. Lee, L. H. Yang, W. Hsu, X. Yang, 'XClust: Clustering XML Schemas for Effective Integration,' Proceedings of the ACM International Conference on Information and Knowledge Management, 2002 https://doi.org/10.1145/584792.584841
  6. M. Zaki, 'Efficiently Mining Frequent Tree in a Forest,' Proceedings of the ACM SIGKDD International Conference, 2002 https://doi.org/10.1145/775047.775058
  7. A. Termier, M. C. Rouster, M. Sebag, 'TreeFinder: A First Step towards XML Data Mining,' Proceedings of IEEE International Conference on Data Mining (ICDM), 2002 https://doi.org/10.1109/ICDM.2002.1183987
  8. S. W. Kim, et ai, 'Indexing and Retrieval of XML-encoded Structured Documents in Dynamic Environment', Lecture Notes in Computer Science(LNCS) Vol. 2480, 2002 https://doi.org/10.1007/3-540-45785-2
  9. A. Deutsch, M. F. Fernandez, and D. Suciu, 'Storing Semistructured Data with STORED,' Proceedings of ACM SIGMOD International Conference on Management of Data, pp.431-442, 1999 https://doi.org/10.1145/304181.304220
  10. D. Katsaros, 'Efficiently Maintaining Structural Associations of Semistructured Data,' Panhellenic Conference on Informatics, LNCS 2563, 2003 https://doi.org/10.1007/3-540-38076-0
  11. K. Wang and H. Liu, 'Discovery Typical Structures of Documents: A Road Map Approach,' In ACM SIGIR Conference on Information Retrieval, 1998 https://doi.org/10.1145/290941.290982
  12. J. H. Hwang, K. H. Ryu, 'A Clustering Technique using Common Structures of XML Documents,' KISS, Vol.32, No.6, 2005
  13. http://www.cogsci.princeton.edu/~wn/wn2.0
  14. J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth,' Proceedings of International Conference on Data Engineering(ICDE), 2001
  15. Y. Yang, X. Guan, J. You, 'CLOPE : A fast and effective clustering algorithm for transaction data,' Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002 https://doi.org/10.1145/775047.775149
  16. NIAGARA query engine. http://www.cs.wisc.edu/niagara/data.html
  17. http://www.acm.org/sigmod/record/xml, 2001