한국어 구 단위화를 위한 규칙 기반 방법과 기억 기반 학습의 결합

A Hybrid of Rule based Method and Memory based Loaming for Korean Text Chunking

  • 발행 : 2004.03.01

초록

한국어나 일본어와 같이 부분 어순 자유 언어에서는 규칙 기반 방법이 구 단위화에 있어서 매우 유용한 방법이며, 실제로 잘 발달된 조사와 어미를 활용하면 소수의 규칙만으로도 여러 가지 기계학습 기법들만큼 높은 성능을 보일 수 있다. 하지만, 이 방법은 규칙의 예외를 처리할 수 있는 방법이 없다는 단점이 있다. 예외 처리는 자연언어처리에서 매우 중요한 문제이며, 기억 기반 학습이 이 문제를 효과적으로 다룰 수 있다. 본 논문에서는, 한국어 단위화를 위해서 규칙 기반 방법과 기억 기반 학습을 결합하는 방법을 제시한다. 제시된 방법은 우선 규칙에 기초하고, 규칙으로 추정한 단위를 기억 기반 학습으로 검증한다. STEP 2000 말뭉치에 대한 실험 결과, 본 논문에서 제시한 방법이 규칙이나 여러 기계학습 기법을 단독으로 사용하였을 때보다 높은 성능을 보였다. 규칙과 구 단위화에 가장 좋은 성능을 보인 Support Vector Machines의 F-score가 각각 91.87과 92.54인데 비하여, 본 논문에서 제시된 방법의 최종 F-score 는 94.19이다.

In partially free word order languages like Korean and Japanese, the rule-based method is effective for text chunking, and shows the performance as high as machine learning methods even with a few rules due to the well-developed overt Postpositions and endings. However, it has no ability to handle the exceptions of the rules. Exception handling is an important work in natural language processing, and the exceptions can be efficiently processed in memory-based teaming. In this paper, we propose a hybrid of rule-based method and memory-based learning for Korean text chunking. The proposed method is primarily based on the rules, and then the chunks estimated by the rules are verified by memory-based classifier. An evaluation of the proposed method on Korean STEP 2000 corpus yields the improvement in F-score over the rules or various machine teaming methods alone. The final F-score is 94.19, while those of the rules and SVMs, the best machine learning method for this task, are just 91.87 and 92.54 respectively.

키워드

참고문헌

  1. L. Ramshaw and M. Marcus, 'Text chunking using transformation-based learning,' In Proceedings of the Third ACL Workshop on Very Large Corpora, pp. 82-94, 1995
  2. S. Argamon, I. Dagan, and Y. Krymolowski, 'A memory-based approach to learning shallow natural language patterns,' In Proceedings of COLING/ACL 98, pp. 67-73, 1998 https://doi.org/10.3115/980451.980857
  3. T. Kudo and Y. Matsumoto, 'Use of support vector learning for chunk identification,' In Proceedings of the 4th Conference on Computational Natural Language Learning, pp. 142-144, 2000
  4. T. Zhang, F. Damerau, and D. Johnson, 'Text chunking using regularized Winnow,' In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 539-546, 2001 https://doi.org/10.3115/1073012.1073081
  5. G. Zhou and J. Su, 'Error-driven HMM-based chunk tagger with context-dependent lexicon,' In Proceedings of EMNLP/VLC-2000, pp. 71-79, 2000
  6. M. Shibatani, The Languages of Japan, Cambridge University Press, 1990
  7. V. Cherkas sky and F. Mulier, Learning from Data: Concepts, Theory, and Methods, John Wiley & Sons, Inc., 1998
  8. 김미영, 강신재, 이종혁, '규칙과 어휘정보를 이용한 한국어 문장의 구묶음(Chunking)', 제12회 한국 및 한국어 정보처리 학술대회 논문집, pp.11-17, 2000
  9. 심효필, '최소자원 최대효과의 구문분석', 제11회 한글 및 한국어 정보처리 학술대회 논문집, pp. 242-244, 1999
  10. W. Daelemans, A. Bosch, and J. Zavrel, 'Forgetting exceptions is harmful in language learning,' Machine Learning, Vol. 34, No.1, pp. 11-41, 1999 https://doi.org/10.1023/A:1007585615670
  11. J.-T. Yoon, K-S. Choi and M.-S. Song, 'Three types of chunking in Korean and dependency analysis based on lexical association,' In Proceedings of the 18th International Conference on Computer Processing Languages, pp. 59-65, 1999
  12. 박성배, 장병탁, '최대 엔츠로피 모델을 이용한 텍스트 단위화 학습', 제13회 한국 및 한국어 정보처리학술대회 논문집, pp. 130-137,2001
  13. Y.-S. Hwang, H.-J. Chung, Y.-J. Kwak, S.-Y. Park, and H.-C. Rim, 'Shallow Parsing by Weighted Probabilistic Sum,' In Proceedings of the 19th International Conference on Computer Processing Languages, pp. 236-241, 2001
  14. M. Kay, 'Algorithm Schemata and Data Structures in Syntactic Processing,' In Readings in Natural Language Processing, pp. 35-70, Morgan Kaufmann, 1970
  15. 김기철, 이기오, 이용석, '형태소 분석 주도의 한국어 복합동사 처리', 정보과학회 논문지, 제22권, 제9호, pp. 1384-1393, 1995
  16. T. Cover and P. Hart, 'Nearest neighbor pattern classification,' IEEE Transactions on Information Theory, Vol. 13, pp. 21-27, 1967 https://doi.org/10.1109/TIT.1967.1053964
  17. W. Daelemans, J. Zavrel, K. Sloot, and A. Bosch, 'TiMBL: Tilburg Memory Based Learner, version 4.1, Reference Guide,' Technical Report ILK 01-04, Tilburg University, 2001
  18. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993
  19. A. Danyluk and F. Provost, 'Small disjuncts in action: Learning to diagnose errors in the local loop of the telephone network,' In Proceedings of the 10th International Conference on Machine Learning, pp. 81-88, 1993
  20. W. Daelemans, J. Zavrel, P. Berek, and S. Gillis, 'MBT: A memory-based part of speech taggergenerator,' In Proceedings of the 4th Workshop on Very Large Corpora, pp. 14-27, 1996
  21. E. Brill, 'Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging,' Computational Linguistics, Vol. 21, No.4, pp. 543-566, 1995
  22. Y. Freund and R. Schapire, 'Experiments with a new boosting algorithm,' In Proceedings of the 13th International Conference on Machine Learning, pp. 148-156, 1996
  23. S. Abney, R. Schapire, and Y. Singer, 'Boosting applied to tagging and PP attachment,' In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 38-45, 1999
  24. 최기선, 남영준, 김진규, 한영균, 박석문, 김진수, 이춘택, 김덕봉, 김재훈, 최병진, '한국어정보베이스를 위한 형태, 통사 태그 표준에 관한 연구', 인지과학, 제7권, 제4호, pp. 43-61, 1996
  25. CoNLL, Shared Task for Computational Natural Language Learning (CoNLL), http://Icg-www.uia.ac.be/conll2000/chunking, 2000
  26. T. Joachirns, 'Making large-scale SVM learning practical,' Technical Report LS8, Universitaet Dortmund, 1998
  27. B. Scholkopf, C. Burges, and A. Smola, Advances in Kernel Methods - Support Vector Learning, MIT Press, 1999
  28. J. Zavrel, W. Daelemans, and J. Veenstra, 'Resolving PP attachment ambiguities with memorybased learning,' In Proceedings of the Conference on Computational Language Learning, pp. 136-144, 1997
  29. Proceedings of the Conference on Computational Language Learning Resolving PP attachment ambiguities with memorybased learning J.Zavrel;W.Daelemans;J.Veenstra