A Study on Keyword Extraction From a Single Document Using Term Clustering

용어 클러스터링을 이용한 단일문서 키워드 추출에 관한 연구

  • 한승희 (서울여자대학교 사회과학대학 문헌정보학과)
  • Received : 2010.07.19
  • Accepted : 2010.08.11
  • Published : 2010.08.30


In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, pf(paragraph frequency) and $tf{\times}ipf$(term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.


Term Clustering;Keyword Extraction;Single Document;Second-order Similarity;Text Mining


Supported by : 서울여자대학교 사회과학연구소


  1. 김수연, 정영미. 2006. 텍스트 마이닝 기법을 이용한 연관용어 선정에 관한 실험적 연구. 정보관리학회지, 23(3): 147-165.(Su-Yeon Kim, & Young-Mee Chung. 2006. "An Experimental Study on Selecting Association Terms Using Text Mining Techniques." Journal of the Korea Society for Information Management, 23(3): 147-165.)
  2. 서은경. 1984. 용어의 자동분류에 관한 연구. 정보관리학회지, 1(1): 78-99.(Eun-Gyoung Seo. 1984. "A Study on Automatic Keyword Classification." Journal of the Korea Society for Information Management, 1(1): 78-99.)
  3. 유사라. 1999. 정보학연구와 분석방법론. 서울: 나남출판.(Sarah Yoo. 1999. Jeongbohakyeonguwa Bunseokbangbeopron. Seoul: Nanamchulpan.)
  4. 이성직, 김한준. 2009. TF-IDF의 변형을 이용한 전자뉴스에서의 키워드 추출 기법. 한국전자거래학회지, 14(4): 59-73.(Sungjick Lee, & Han-joon Kim. 2009. "Keyword Extraction from News Corpus using Modified TF-IDF." The Journal of Society for e-Business Studies, 14(4): 59-73.)
  5. 이재윤. 2007. 분포 유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구. 정보관리학회지, 24(4): 267-283.(Jae-Yun Lee. 2007. "Improving the Performance of Document Clustering with Distributional Similarities." Journal of the Korea Society for Information Management, 24(4): 267-283.)
  6. 이주호, 김학수. 2009. 의존관계를 이용한 단일문서의 키워드 추출. 2009 한국컴퓨터종합학술대회논문집, 36(1): 293-296.(Jooho Lee, & Harksoo Kim. 2009. "Keyword Extraction of Single Document using Dependency relation." 2009 Proceedings of KIISE, 36(1): 293-296.)
  7. 정영미. 2005. 정보검색연구. 서울: 구미무역.(Young-Mee Chung. 2005. Jeongbogeomseakyeongu. Seoul: kumimuyeok.)
  8. 정영미. 1993. 정보검색론. 서울: 구미무역.(Young-Mee Chung. 1993. Jeongbogeomseakron. Seoul: kumimuyeok.)
  9. 한승희, 정영미. 2004. 클러스터링 기법을 이용한 개별문서의 지식구조 자동 생성에 관한 연구. 정보관리학회지, 21(3): 251-267.(Seung-Hee Han, & Young-Mee Chung. 2004. "Automatic Generation of the Local Level Knowledge Structure of a Single Document Using Clustering Methods." Journal of the Korea Society for Information Management, 21(3): 251-267.)
  10. Al-Khalifa, Hend S., & Hugh C. Davis. 2006. "Folksonomies versus automatic keyword extraction: an empirical study." Proceedings of IADIS Web Applications and Research, 2: 132-143.
  11. Callan, James P. 1994. "Passage-level evidence on document retrieval." Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 302-310.
  12. Dagan, Ido, Lillian Lee, & Fernando Pereira. 1999. "Similarity-based models of cooccurrence probabilities." Machine Learning, 34(1-3): 43-69.
  13. Hulth, A., Jussi Karlgren, Anna Jonsson, Henrik Bostrom, & Lars Asker. 2010. "Automatic Keyword Extraction Using Domain Knowledge." Lecture Notes in Computer Science, 2004/2010: 472-482.
  14. Kullback, Solomon. 1968. Information Theory and Statistics, 2nd ed. New York: Dover Books.
  15. Lee, Lillan. 1999. "Measures of distributional similarity." Proceedings of 37th Annual Meeting of the Association for Computational Linguistics, 25-32.
  16. Leweis, David D., & W. Bruce Croft. 1990. "Term clustering of syntactic phrases." Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 385-404.
  17. Lin, J. 1991. "Divergence measures based on the Shannon entropy." IEEE Transactions on Information Theory, 37(1): 145-151.
  18. Liu, M., Li, W., Wu Mingli, & Qin Lu. 2007. "Extractive summarization based on event term clustering." Proceedings of the ACL 2007, 185-188.
  19. Matzuo, Y., & M. Ishizuka. 2004. "Keyword extraction from a single document using word co-occurrence statistical information." International Journal on artificial Intelligence Tool, 13(1): 157-169.
  20. Pereira, F., Naftali Tishby, & Lillian Lee. 1993. "Distributional clustering of English words." Proceedings of the 31st Annual Meeting of the ACL, 183-190.
  21. Plas, L. van der, V. Pallotta, M. Rajman, & H. Ghorbel. 2004. "Automatic keyword extraction from spoken text." Proceedings of the 4th International Conference on Language Resources and Evaluation 2004, 2205-2208.
  22. Sneath, P. H. A., and R. R. Sokal. 1973. Numerical Taxonomy. SF: Freeman.
  23. Sparck Jones, K. 1971. Automatic Keyword Classification for Information Retrieval. London: Butterworth&Co.
  24. Sparck Jones, K. 1972. "Automatic indexing." Journal of Documentation, 30(4): 393-432.
  25. Strehl, Alexander, Joydeep Ghosh, & Raymond Mooney. 2000. "Impact of similarity measures on web-page clustering." Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search(AAAI 2000), 58-64.
  26. Suzuki, Y., F. Fukumoto, Y. Sekiguchi. 1998. "Keyword extraction of radio news using term weighting with an encyclopedia and newspaper articles." Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 373-374.
  27. Tombros, Anastasios. 2002. The Effects of Query-based Hierarchical Clustering of Documents for Information Retrieval. Ph.D. diss., Cornell University.
  28. Turney, Peter D. 2000. "Learning algorithm for keyphrase extraction." Information Retrieval, 2(4): 303-36.
  29. Weeds, J. E. 2003. Measures and Applications of Lexical Distributional Similarity. Ph. D. diss., University of Sussex.
  30. White, H. D., & B. C. Griffith. 1981. "Author cocitation: a literature measure of intellectual structure." Journal of the American Society for Information Science, 32: 163-171.
  31. Witten, Ian H., Paynter, Gordon W., Frank, Eibe., Gutwin, Carl., & Nevill-Manning, Craig G. 1999. "KEA: practical automatic keyphrase extraction." Proceedings of the 4th ACM Conference on Digital Library, 254-255.
  32. Zobel, J., A. Moffat, R. Wilkinson, & R. Sacks-Davis. 1995. "Efficient Retrieval of Partial Documents." Information Processing and Management, 31(3): 36-377.

Cited by

  1. Analysis of the characteristics of expressway traffic information propagation using Twitter vol.20, pp.7, 2016,
  2. Intellectual structure of Korean theology 2000–2008: Presbyterian theological journals vol.39, pp.3, 2013,
  3. The Design and Implementation of OWL Ontology Construction System through Information Extraction of Unstructured Documents vol.19, pp.10, 2014,
  4. Analyzing Customer Feedback Differences between VOCs and External Channels vol.41, pp.3, 2018,