Domain Adaptation Method for LHMM-based English Part-of-Speech Tagger

LHMM기반 영어 형태소 품사 태거의 도메인 적응 방법

  • 권오욱 (한국전자통신연구원 언어처리연구팀) ;
  • 김영길 (한국전자통신연구원 언어처리연구팀)
  • Received : 2010.08.06
  • Accepted : 2010.09.16
  • Published : 2010.10.15

Abstract

A large number of current language processing systems use a part-of-speech tagger for preprocessing. Most language processing systems required a tagger with the highest possible accuracy. Specially, the use of domain-specific advantages has become a hot issue in machine translation community to improve the translation quality. This paper addresses a method for customizing an HMM or LHMM based English tagger from general domain to specific domain. The proposed method is to semi-automatically customize the output and transition probabilities of HMM or LHMM using domain-specific raw corpus. Through the experiments customizing to Patent domain, our LHMM tagger adapted by the proposed method shows the word tagging accuracy of 98.87% and the sentence tagging accuracy of 78.5%. Also, compared with the general tagger, our tagger improved the word tagging accuracy of 2.24% (ERR: 66.4%) and the sentence tagging accuracy of 41.0% (ERR: 65.6%).

형태소 품사 태거는 언어처리 시스템의 전처리기로 많이 활용되고 있다. 형태소 품사 태거의 성능 향상은 언어처리 시스템의 전체 성능 향상에 크게 기여할 수 있다. 자동번역과 같이 복잡도가 놓은 언어처리 시스템은 최근 특정 도메인에서 좋은 성능을 나타내는 시스템을 개발하고자 한다. 본 논문에서는 기존 일반도메인에서 학습된 LHMM이나 HMM 기반의 영어 형태소 품사 태거를 특정 도메인에 적응하여 높은 성능을 나타내는 방법을 제안한다. 제안하는 방법은 특정도메인에 대한 원시코퍼스를 이용하여 HMM이나 LHMM의 기학습된 전이확률과 출력확률을 도메인에 적합하게 반자동으로 변경하는 도메인 적응 방법이다. 특허도메인에 적응하는 실험을 통하여 단어단위 태깅 정확률 98.87%와 문장단위 태깅 정확률 78.5%의 성능을 보였으며, 도메인 적응하지 않은 형태소 태거보다 단어단위 태깅 정확률 2.24% 향상(ERR: 6.4%)고 문장단위 태깅 정확률 41.0% 향상(ERR: 65.6%)을 보였다.

Keywords

References

  1. Brants, T., "TnT - a statistical part-of-speech tagger," Proceedings of the Sixth Applied Natural Language Processing (ANLP-2000), Seattle, WA, pp.224-231, 2000.
  2. Merialdo, B., "Tagging English text with a probabilistic model," Computational Linguistics, vol.20, no.2, pp.155-171, 1994.
  3. Ferran Pla and Antonio Molina, "Improving Partof-speech Tagging Using Lexicalized HMMs," Natural Language Engineering, vol.10, no.2, pp.167-189, 2004. https://doi.org/10.1017/S1351324904003353
  4. John Lafferty, Andrew McCallum, and Fernando Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proceedings of the Eighteenth International Conference on Machine Learning 2001, pp.282- 289, 2001.
  5. Brill, E., "Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging," Computational Linguistics, vol.21, no.4, pp.543-565, 1995.
  6. Daelemans, W., Zavrel, J., Berck, P. and Gillis, S. "MBT: A memory-based part-of-speech tagger generator," Proceedings 4th Workshop on Very Large Corpora, pp.14-27, 1996.
  7. Ma'rquez, L., Padro', L. and Rodr'ıguez, H, "A machine learning approach to POS tagging," Machine Learning, vol.39, no.1, pp.59-91, 2000. https://doi.org/10.1023/A:1007673816718
  8. Ratnaparkhi, A., "A maximum entropy part-ofspeech tagger," Proceedings 1st Conference on Empirical Methods in Natural Language Processing, E.
  9. Brill, E. and Wu, J., "Classifier Combination for Improved Lexical Disambiguation," Proceedings Joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, COLING- ACL, pp.191-195. Montr'eal, Canada, 1998.