Document Clustering with Relational Graph Of Common Phrase and Suffix Tree Document Model

공통 Phrase의 관계 그래프와 Suffix Tree 문서 모델을 이용한 문서 군집화 기법

  • 조윤호 (고려대학교 정보통신대학 컴퓨터통신공학부) ;
  • 이상근 (고려대학교 정보통신대학 컴퓨터통신공학부)
  • Published : 2009.02.28


Previous document clustering method, NSTC measures similarities between two document pairs using TF-IDF during web document clustering. In this paper, we propose new similarity measure using common phrase-based relational graph, not TF-IDF. This method suggests that weighting common phrases by relational graph presenting relationship among common phrases in document collection. And experimental results indicate that proposed method is more effective in clustering document collection than NSTC.


  1. H. Chim and X. Deng, "A New Suffix Tree Similarity Measure for Document Clustering," In Proceedings of the 16th International Conference on World Wide Web, pp.121-130, 2007.
  2. G. Salton and C. Buckley, "Term-Weighting Approaches In Automatic Text Retrieval," Information Processing and Management, Vol.24, No.5, pp.513-523, 1988.
  3. E. Ukkonen, "On-Line Construction of Suffix Trees," Algorithmica, Vol.14, No.3, pp.249-260, 1995.
  4. E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm," Journal of the ACM, Vol.23, No.2, pp.262-272, 1976.
  5. O. Zamir and O. Etzioni, "Web Document Clustering: A Feasibility Demonstration," In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp.46-54, 1998.
  6. F. Gelgi, H. Davulcu, and S. Vadrevu, "Term Ranking for Clustering Web Search Results," In Proceedings of the 10th International Workshop on Web and Database, 2007.
  7. E. M. Voorhees, "Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval," Information Processing and Management, Vol.22, No.6, pp.465-476, 1986.
  8. S. Brin and L. Page, "The Anatomy of a Large Scale Hypertextual Web Search Engine," In Proceedings of the 7th International Conference on World Wide Web, pp.107-117, 1998.
  9. L. Page, S. Brin, R. Motwani, and T. Winograd, "The Pagerank Citation Ranking: Bringing Order to the Web," Technical Report, Stanford Digital Library Technologies Project, 1998.
  10. W. Hersh, C. Buckley, T. J. Leone, and D. Hickam, "OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research," In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.192-201, 1994.
  11. D. D. Lewis, Y. Yang, T .G. Rose, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research," Journal of Machine Learning Research, Vol.5, pp.361-397, 2004.
  12. M. Rosell, V. Kann, and J. E. Litton, "Comparing comparisons: Document clustering evaluation using two manual classifications," In Proceedings of the 3th International Conference on Natural Language Processing, 2004.