DOI QR코드

DOI QR Code

A Study on Statistical Feature Selection with Supervised Learning for Word Sense Disambiguation

단어 중의성 해소를 위한 지도학습 방법의 통계적 자질선정에 관한 연구

  • 이용구 (계명대학교 문헌정보학과)
  • Received : 2011.05.13
  • Accepted : 2011.06.10
  • Published : 2011.06.30

Abstract

This study aims to identify the most effective statistical feature selecting method and context window size for word sense disambiguation using supervised methods. In this study, features were selected by four different methods: information gain, document frequency, chi-square, and relevancy. The result of weight comparison showed that identifying the most appropriate features could improve word sense disambiguation performance. Information gain was the highest. SVM classifier was not affected by feature selection and showed better performance in a larger feature set and context size. Naive Bayes classifier was the best performance on 10 percent of feature set size. kNN classifier on under 10 percent of feature set size. When feature selection methods are applied to word sense disambiguation, combinations of a small set of features and larger context window size, or a large set of features and small context windows size can make best performance improvements.

이 연구는 지도학습 방법을 이용한 단어 중의성 해소가 최적의 성능을 가져오는 통계적 자질선정 방법과 다양한 문맥의 크기를 파악하고자 하였다. 실험집단인 한글 신문기사에 자질선정 기준으로 정보획득량, 카이제곱 통계량, 문헌빈도, 적합성 함수 등을 적용하였다. 실험 결과, 텍스트 범주화 기법과 같이 단어 중의성 해소에서도 자질선정 방법이 매우 유용한 수단이 됨을 알 수 있었다. 실험에 적용한 자질선중 기준 중에 정보획득량이 가장 좋은 성능을 보였다. SVM 분류기는 자질집합 크기와 문맥 크기가 클수록 더 좋은 성능을 보여 자질선정에 영향을 받지 않았다. 나이브 베이즈 분류기는 10% 정도의 자질집합 크기에서 가장 좋은 성능을 보였다. kNN의 경우 10% 이하의 자질에서 가장 좋은 성능을 보였다. 단어 중의성 해소를 위한 자질선정을 적용할 때 작은 자질집합 크기와 큰 문맥 크기를 조합하거나, 반대로 큰 자질집합 크기와 작은 문맥 크기를 조합하면 성능을 극대화 할 수 있다.

Keywords

References

  1. 정영미. 2005. 정보검색연구. 서울: 구미무역 (주) 출판부.
  2. 정영미, 이용구. 2005. 정보검색 성능 향상을 위 한 단어 중의성 해소모형에 관한 연구. 정보관리학회지, 22(2): 125-145.
  3. Escudero, G., L. Marquez, and G. Rigau. 2000. "Naive Bayes and Exemplar-based Approaches to Word Sense Disambiguation Revisited." Proceedings of the 14th European Conference on Artificial Intelligence, 421-425.
  4. Fragos, K. 2008. "Disambiguation of Greek Polysemous Words Using Hierachical Probabilistic Networks and a Chi-square Feature Selection Strategy." International Journal on Artificial Intelligence Tools, 17(4): 687-701. https://doi.org/10.1142/S0218213008004102
  5. Gale, W., K. Church, and D. Yarowsky. 1992. "One Sense per Discourse." Proceedings of the DARPA Speech and Natural Language Workshop, 233-237.
  6. Guyon, I. and A. Elisseeff. 2003. "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3: 1157-1182.
  7. Hoste, V., W. Daelemans, I. Hendrickx, and A. Bosch. 2002. "Evaluating the Results of a Memory-based Word-expert Approach to Unrestricted Word Sense Disambiguation." Proceedings of the Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, 95-101.
  8. Jackson, P. and I. Moulinier. 2002. Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. Amsterdam: Benjamins Publishing Co.
  9. Joachims, T. 2001. Learning to Classify Text Using Support Vector Machines. Boston: Kluwer Academic Publishers.
  10. Mihalcea, R. 2002. "Word Sense Disambiguation with Pattern Learning and Automatic Feature Selection." Natural Language Engineering, 8(4): 343-358. https://doi.org/10.1017/S1351324902002991
  11. Navigli, R. 2009. "Word Sense Disambiguation: A Survey." ACM Computing Surveys, 41(2): 1-69.
  12. Ng, T. and H. B. Lee. 1996. "Integrating Multiple Knowledge Sources to Disambiguate Word Senses: An Exemplar-based Approach." Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 40-47.
  13. Orhan, Z. and Z. Altan. 2006. "Impact of Feature Selection for Corpus-based WSD in Turkish." Proceedings of the MICAI 2006: Advances in Artificial Intelligence (LNCS 4293), 868-878.
  14. Sebastiani, F. 2002. "Machine Learning in Automated Text Categorization." ACM Computing Surveys, 34(1): 1-47. https://doi.org/10.1145/505282.505283
  15. Stevenson, M. 2003. Word Sense Disambiguation: The Case for Combinations for Knowledge Sources. California: CSLI Publications.
  16. Stevenson, M. and Y. Wilks. 2001. "The Interaction of Knowledge Sources in Word Sense Disambiguation." Computational Linguistics, 27(3): 321-349. https://doi.org/10.1162/089120101317066104
  17. Strapparava, C., A. Gliozzo, and C. Giuliano. 2004. "Pattern Abstraction and Term Similarity for Word Sense Disambiguation: IRST at Senseval-3." Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 229-234.
  18. Suarez, A. and M. Palomar. 2002. "Improving Feature Selection for Maximum Entropybased Word Sense Disambiguation." Proceedings of the PorTAL 2002(LNAI 2389), 15-23.
  19. Yang, Y. and J. O. Pedersen. 1997. "A Comparative Study on Feature Selection in Text Categorization." Proceedings of the 14th International Conference on Machine Learning, 412-420.

Cited by

  1. 문헌빈도와 장서빈도를 이용한 kNN 분류기의 자질선정에 관한 연구 vol.44, pp.1, 2011, https://doi.org/10.16981/kliss.44.1.201303.27
  2. 한국어 명사의 지식기반 의미중의성 해소를 위한 효과적인 품사집합 vol.16, pp.4, 2016, https://doi.org/10.5392/jkca.2016.16.04.418