DOI QR코드

DOI QR Code

Representative Labels Selection Technique for Document Cluster using WordNet

문서 클러스터를 위한 워드넷기반의 대표 레이블 선정 방법

  • Kim, Tae-Hoon (Department of Industrial Engineering, Sungkyunkwan University) ;
  • Sohn, Mye (Department of Industrial Engineering, Sungkyunkwan University)
  • Received : 2016.10.10
  • Accepted : 2017.02.28
  • Published : 2017.04.30

Abstract

In this paper, we propose a Documents Cluster Labeling method using information content of words in clusters to understand what the clusters imply. To do so, we calculate the weight and frequency of the words. These two measures are used to determine the weight among the words in the cluster. As a nest step, we identify the candidate labels using the WordNet. At this time, the candidate labels are matched to least common hypernym of the words in the cluster. Finally, the representative labels are determined with respect to information content of the words and the weight of the words. To prove the superiority of our method, we perform the heuristic experiment using two kinds of measures, named the suitability of the candidate label ($Suitability_{cl}$) and the appropriacy of representative label ($Appropriacy_{rl}$). In applying the method proposed in this research, in case of suitability of the candidate label, it decreases slightly compared with existing methods, but the computational cost is about 20% of the conventional methods. And we confirmed that appropriacy of the representative label is better results than the existing methods. As a result, it is expected to help data analysts to interpret the document cluster easier.

본 연구에서는 문서 클러스터링 결과 도출된 개별 클러스터가 함축하고 있는 의미를 파악하는 데 필요한 어휘들의 정보량을 활용한 문서 클러스터 레이블링(Documents Cluster Labeling) 방법을 제안하였다. 이를 위해, 클러스터에 포함된 어휘들이 해당 클러스터에서 얼마나 중요한 비중을 차지하고 있는지 파악하기 위하여 각 어휘의 출현 빈도와 정보량을 이용한 어휘의 가중치를 계산한 후, 워드넷을 이용하여 클러스터에 포함된 어휘들의 최근접 공통 상위어를 후보 레이블로 식별하였다. 이상의 과정을 거쳐 식별된 후보 레이블의 정보량과 클러스터내에서의 중요도 가중치를 활용해, 해당 클러스터의 의미와 특징을 포괄적으로 표현할 수 있는 대표 레이블을 결정하였다. 본 연구의 우수성을 입증하기 위해 다음과 같은 실험을 수행하였다. 실험은 본 연구에서 제안한 방법에 따라 선정된 레이블과 후보 레이블을 워드넷에 프로젝션한 후, 워드넷상에서 이들 레이블의 위치(깊이)를 확인하였다. 또한 선정된 후보 레이블을 상위어로 갖고 있는 클러스터 내 어휘의 수를 도출하여, 휴리스틱 방법에 따라 선정된 레이블을 전문가가 찾은 대표 레이블과의 비교를 수행하였다. 평가지표로 후보 레이블의 적합성($Suitability_{cl}$)과 대표 레이블의 적절성($Appropriacy_{rl}$)을 활용하였다. 실험 결과, 본 연구에서 제안한 방법을 적용해 문서 클러스터 레이블링을 수행할 경우, 후보 레이블의 적합성의 경우 기존의 방법보다 약간 감소하지만 계산량이 기존 방법의 약 20% 정도로 감소하였으며, 대표 레이블의 적절성의 경우 기존의 방법보다 우수한 결과를 도출하는 것을 확인하였다.

Keywords

References

  1. Q. Mei, X. Shen, and C. Zhai, "Automatic labeling of multinomial topic models," In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 490-499, 2007. https://doi.org/10.1145/1281192.1281246
  2. R. Mihalcea and P. Tarau, "TextRank: Bringing order into texts," Association for Computational Linguistics, 2004. http://digital.library.unt.edu/ark:/67531/metadc30962/
  3. W. Lu, Q. Cheng and C. Lioma, "Fixed versus dynamic co-occurrence windows in TextRank term weights for information retrieval," In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 1079-1080, 2012. https://doi.org/10.1145/2348283.2348478
  4. F. Role and M. Nadif, "Beyond cluster labeling: Semantic interpretation of clusters' contents using a graph representation," Knowledge-Based Systems, vol. 56, pp. 141-155, 2014. http://dx.doi.org/10.1016/j.knosys.2013.11.005
  5. C. T. Nguyen, X. H. Phan, S. Horiguchi, T. T. Nguyen and Q. T. Ha, "Web search clustering and labeling with hidden topics," ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, issue. 3, pp. 12, 2009. https://doi.org/10.1145/1568292.1568295
  6. Z. S. Syed, T. Finin and A. Joshi, "Wikipedia as an Ontology for Describing Documents," In ICWSM, 2008. http://www.aaai.org/Papers/ICWSM/2008/ICWSM08-024.pdf
  7. D. Carmel, H. Roitman and N. Zwerdling, "Enhancing cluster labeling using Wikipedia," In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 139-146, 2009. https://doi.org/10.1145/1571941.1571967
  8. Z. Li, J. Li, Y. Liao, S. Wen and J. Tang, "Labeling clusters from both linguistic and statistical perspectives: A hybrid approach," Knowledge-Based Systems, vol. 76, pp. 219-227, 2015. http://dx.doi.org/10.1016/j.knosys.2014.12.019
  9. Y. H. Tseng, "Generic title labeling for clustered documents," Expert Systems with Applications, vol. 37, issue. 3, pp. 2247-2254, 2010. http://dx.doi.org/10.1016/j.eswa.2009.07.048
  10. C. Bouras and V. Tsogkas, "A clustering technique for news articles using WordNet," Knowledge-Based Systems, vol. 36, pp. 115-128, 2012. http://dx.doi.org/10.1016/j.knosys.2012.06.015
  11. W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International Journal of Computer Applications, vol. 68, no. 13, pp. 13-18, 2013. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.403.5446&rep=rep1&type=pdf https://doi.org/10.5120/11638-7118
  12. D. Sanchez, M. Batet, D. Isern and A. Valls, "Ontology-based semantic similarity: A new feature-based approach," Expert Systems with Applications, vol. 39, issue. 9, pp. 7718-7728, 2012. http://dx.doi.org/10.1016/j.eswa.2012.01.082
  13. G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, issue. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748
  14. T. Pedersen, S. Patwardhan and J. Michelizzi, "WordNet: Similarity: measuring the relatedness of concepts," In Demonstration papers at HLT-NAACL 2004, pp. 38-41, 2004. Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1614037
  15. WordNet, "A lexical database for the English language," Cognitive Science Laboratory, Princeton University. 2004. http://wordnet.princeton.edu
  16. P. Treeratpituk and J. Callan, "Automatically labeling hierarchical clusters," In Proceedings of the 2006 international conference on Digital government research, pp. 167-176, 2006. https://doi.org/10.1145/1146598.1146650
  17. H. Anaya-Sanchez, A. Pons-Porrata and R. Berlanga-Llavori, "A new document clustering algorithm for topic discovering and labeling," In Iberoamerican Congress on Pattern Recognition, pp. 161-168, 2008. https://link.springer.com/chapter/10.1007/978-3-540-85920-8_20
  18. T. Okuoka, T. Takahashi, D. Deguchi, I. Ide and H. Murase, "Labeling news topic threads with Wikipedia entries," 11th IEEE International Symposium on Multimedia, pp. 501-504, 2009. https://doi.org/10.1109/ISM.2009.67
  19. X. L. Mao, Z. Y. Ming, Z. J. Zha, T. S. Chua, H. Yan and X. Li, "Automatic labeling hierarchical topics," In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 2383-2386, 2012. https://doi.org/10.1145/2396761.2398646
  20. J. H. Lau, K. Grieser, D. Newman and T. Baldwin, "Automatic labelling of topic models," In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1536-1545, 2011. http://dl.acm.org/citation.cfm?id=2002658
  21. I. Hulpus, C. Hayes, M. Karnstedt and D. Greene, "Unsupervised graph-based topic labelling using dbpedia," In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474, 2013. https://doi.org/10.1145/2433396.2433454
  22. H. Roitman, S. Hummel and M. Shmueli-Scheuer, "A fusion approach to cluster labeling," In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 883-886, 2014. https://doi.org/10.1145/2600428.2609465
  23. A. Panchenko and O. Morozova, "A study of hybrid similarity measures for semantic relation extraction," In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pp. 10-18, 2012. http://dl.acm.org/citation.cfm?id=2388634
  24. S. Hingmire, S. Chougule, G. K. Palshikar and S. Chakraborti, "Document classification by topic labeling," In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 877-880, 2013. https://doi.org/10.1145/2484028.2484140
  25. T. H. Kim, "A study of Document Cluster Labeling using Information Content of words", Master Dissertation of Sungkyunkwan Unversity, 2016. http://dcollection.skku.edu/jsp/common/DcLoOrgPer.jsp?sItemId=000000096202

Cited by

  1. RGB-D 정보를 이용한 객체 탐지 기반의 신체 키포인트 검출 방법 vol.18, pp.6, 2017, https://doi.org/10.7472/jksii.2017.18.6.85