DOI QR코드

DOI QR Code

Improving the Retrieval Effectiveness by Incorporating Word Sense Disambiguation Process

정보검색 성능 향상을 위한 단어 중의성 해소 모형에 관한 연구

  • Published : 2005.06.01

Abstract

This paper presents a semantic vector space retrieval model incorporating a word sense disambiguation algorithm in an attempt to improve retrieval effectiveness. Nine Korean homonyms are selected for the sense disambiguation and retrieval experiments. The total of approximately 120,000 news articles comprise the raw test collection and 18 queries including homonyms as query words are used for the retrieval experiments. A Naive Bayes classifier and EM algorithm representing supervised and unsupervised learning algorithms respectively are used for the disambiguation process. The Naive Bayes classifier achieved $92\%$ disambiguation accuracy. while the clustering performance of the EM algorithm is $67\%$ on the average. The retrieval effectiveness of the semantic vector space model incorporating the Naive Bayes classifier showed $39.6\%$ precision achieving about $7.4\%$ improvement. However, the retrieval effectiveness of the EM algorithm-based semantic retrieval is $3\%$ lower than the baseline retrieval without disambiguation. It is worth noting that the performances of disambiguation and retrieval depend on the distribution patterns of homonyms to be disambiguated as well as the characteristics of queries.

이 연구에서는 문헌 및 질의의 내용을 대표하는 주제어의 중의성 해소를 위해 대표적인 지도학습 모형인 나이브 베이즈 분류기와 비지도학습 모형인 EM 알고리즘을 각각 적용하여 검색 실험을 수행한 다음 주제어의 중의성 해소를 통해 검색 성능의 향상을 가져올 수 있는지를 평가하였다. 실험문헌 집단은 약 12만 건에 달하는 한국어 신문기사로 구성하였으며, 중의성 해소 대상 단어로는 한국어 동형이의어 9개를 선정하였다. 검색 실험에는 각 중의성 단어를 포함하는 18개의 질의를 사용하였다. 중의성 해소 실험 결과 나이브 베이즈 분류기는 최적의 조건에서 평균 $92\%$의 정확률을 보였으며, EM 알고리즘은 최적의 조건에서 평균 $67\%$ 수준의 클러스터링 성능을 보였다. 중의성 해소 알고리즘을 통합한 의미기반 검색에서는 나이브 베이즈 분류기 통합 검색이 약 $39.6\%$의 정확률을 보였고, EM 알고리즘 통합 검색이 약 $36\%$의 정확률을 보였다. 중의성 해소 모형을 적용하지 않은 베이스라인 검색의 정확률 $37\%$와 비교하면 나이브 베이즈 통합 검색은 약 $7.4\%$의 성능 향상률을 보인 반면 EM 알고리즘 통합 검색은 약 $3\%$의 성능 저하율을 보였다.

Keywords

References

  1. 허정, 옥철영. 2001. 사전의 뜻풀이말에서 추출한 의미정보에 기반한 동형이의어 중의성 해결 시스템. '정보과학회논문지: 소프트웨어 및 응용', 28(9): 688-698
  2. Chung, Y. M., and Lee, J. Y. 2001. 'A Corpus-based Approach to Comparative Evaluation of Statistical Term Association Measures.' Journal of the American Society for Information Science and Technology, 52(4): 283-296 https://doi.org/10.1002/1532-2890(2000)9999:9999<::AID-ASI1073>3.0.CO;2-5
  3. Gale, W. A. 1992. 'A Method for Disambiguating Word Sense in a Large Corpus.' Computers and the Humanities, 26: 415-439 https://doi.org/10.1007/BF00136984
  4. Gale, W., Church, K. W., and Yarowsky, D. 1992a. 'Estimating Upper and Lower Bounds on the Performance of Word Sense Disambiguation Programs.' Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, 249-256
  5. Gale, W., Church, K. W., and Yarowsky, D. 1992b. 'One sense per discourse.' Proceedings of the Speech and Natural Language Workshop, 233-237
  6. Gale, W., Church, K. W., and Yarowsky, D. 1993. 'A method for disambiguating word senses in a large corpus.' Computers and the Humanities, 415-439
  7. Ide, N., and Veronis, J. 1998. 'Word sense disambiguation: the state of the art.' Computational Linguistics, 24(1): 1-40
  8. Jackson. P., and Moulinier, I. 2002. Natural Language Processing for Online Applications : Text Retrieval. Extraction. and Categorization. Amsterdam: John Benjamins Publishing Company
  9. Jansen, B. J., and Spink, A. 2005. 'An analysis of web searching by European AlltheWeb.com users.' Information Processing and Management, 41 : 361-381 https://doi.org/10.1016/S0306-4573(03)00067-0
  10. Jansen, B. J., Spink A., and Saracevic, T. 2000. 'Real life, real users, and real needs: a study and analysis of user queries on the web.' Information Processing and Management, 36: 207-227 https://doi.org/10.1016/S0306-4573(99)00056-4
  11. Krovets, R., and Croft. W. B. 1992. 'Lexical ambiguity and information retrieval.' ACM Transactions on Information Retrieval Systems, 10(2): 115-141 https://doi.org/10.1145/146802.146810
  12. Levinson, D. 1999. 'Corpus-based method for unsupervised word sense disambiguation.' Proceedings of the Workshop on Machine Learning in Human Language Technology. Advanced Course on Artificial Intelligence (ACAI 99), Chania, Greece, 267-273
  13. Manning, C. D., and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge: MIT Press
  14. Sanderson, M. 1994. 'Word sense disambiguation and information retrieval' Proceedings of the 17th international ACM SIGIR, 49-57
  15. Sanderson, M. 2000. 'Retrieving with good. sense." Information Retrieval, 2(1): 49-69 https://doi.org/10.1023/A:1009933700147
  16. Schutze, H. 1998. 'Automatic word sense discrimination.' Computational Linguistion Archive, 24(1) : 97-123
  17. Schutze, H., and Pederson. J. 1995. 'Information retrieval based on word sense.' Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval. 161-175
  18. Stevenson, M. 2003. Word Sense Disambiguation: the case for Combination for Knowledge Sources. California: CSLI Publications
  19. Stokoe, C., Oakes, M. J., and Tait, J. 2003. 'Word sense disambiguation in information retrieval revisited.' Proceedings of the 26th ACM SIGIR, 159-166
  20. TREC-7. 1999. Proceedings of the Seventh Text Retrieval Conference. Appendix A. Evaluation Techniques and Results. NIST Publication 500-242
  21. Voorhees, E. M. 1993. 'Using WordNet to disambiguate word senses for text retrieval.' Proceedings of SIGIR '93, 171-180
  22. Yarowsky, D. 1995. 'Unsupervised word sense disambiguation rivaling supervised methods.' Annual Meeting of the ACL Archive Proceedings of the 33rd conference on Association for Computational Linguistics, 189-196

Cited by

  1. A Study of Intelligent Recommendation System based on Naive Bayes Text Classification and Collaborative Filtering vol.41, pp.4, 2010, https://doi.org/10.1633/JIM.2010.41.4.227