Detecting and correcting errors in Korean POS-tagged corpora

Choi, Myung-Gil;Seo, Hyung-Won;Kwon, Hong-Seok;Kim, Jae-Hoon;

doi:10.5916/jkosme.2013.37.2.227

Journal of Advanced Marine Engineering and Technology

제37권2호
/
Pages.227-235
/
2013
/
2234-7925(pISSN)
/
2234-8352(eISSN)

한국마린엔지니어링학회 (The Korean Society of Marine Engineering)

DOI QR Code

한국어 품사 부착 말뭉치의 오류 검출 및 수정

Detecting and correcting errors in Korean POS-tagged corpora

최명길 (금호마린테크) ;
서형원 (한국한국해양대학교 컴퓨터공학과) ;
권홍석 (한국한국해양대학교 컴퓨터공학과) ;
김재훈 (한국해양대학교 IT공학부)

투고 : 2013.02.05
심사 : 2013.02.28
발행 : 2013.03.31

https://doi.org/10.5916/jkosme.2013.37.2.227 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

품사 부착 말뭉치의 품질은 품사 부착기를 개발하는데 있어서 매우 중요한 역할을 수행한다. 그러나 세종 말뭉치를 비롯하여 한국에서 구축된 많은 품사 부착 말뭉치들은 여전히 다양한 형태의 오류를 포함하고 있다. 이런 오류들을 살펴보면 품사 부착 오류는 물론이고 철자 오류, 문자의 삽입 및 삭제 등 매우 다양하다. 본 논문에서는 오류 패턴을 이용하여 품사 부착 오류를 검출하고 이를 효과적으로 수정하는 도구를 개발한다. 제안된 방법과 도구를 이용해서 오류를 수정할 경우 평균 9배 이상 빠르게 오류를 수정할 수 있어서 이 방법이 매우 효과적인 방법임을 확인할 수 있었다.

The quality of the part-of-speech (POS) annotation in a corpus plays an important role in developing POS taggers. There, however, are several kinds of errors in Korean POS-tagged corpora like Sejong Corpus. Such errors are likely to be various like annotation errors, spelling errors, insertion and/or deletion of unexpected characters. In this paper, we propose a method for detecting annotation errors using error patterns, and also develop a tool for effectively correcting them. Overall, based on the proposed method, we have hand-corrected annotation errors in Sejong POS Tagged Corpus using the developed tool. As the result, it is faster at least 9 times when compared without using any tools. Therefore we have observed that the proposed method is effective for correcting annotation errors in POS-tagged corpus.

키워드

참고문헌

J.-H. Kim and G. C. Kim, Guideline on Building a Korean Part-of-Speech Tagged Corpus: KAIST Corpus, Technical Report CS-TR-95-99, Department of Computer Science, KAIST, 1995 (in Korean).
C.-H. Han and N.-R. Han, Part of Speech Tagging Guidelines for Penn Korean Treebank, Technical Report IRCS Report 01-09, Institute for Research in Cognitive Science, University of Pennsylvania, 2001.
H.-G. Kim, 21st Century Sejong Project - Construction of the Primary Data of the Korean Language, Research Report NIKL 2007-01-10, National Institute of the Korean Language, 2007 (in Korean).
M. Lee, H. Jung, W.-K. Sung, and D.-I. Park, "Verification of POS tagged corpus,", Proceedings of the 17th Annual Conference on Human and Cognitive Language Technology, pp. 145-150, 2005 (in Korean).
J.-H. Kim, H.-W. Seo, K.-H. Jeon, and M.-G. Choi, "Error correction methods for Sejong corpus," Proceedings of the KOSME Spring Conference, pp. 435-436. 2010 (in Korean).
M. Dickinson, Error Detection and Correction in Annotated Corpora. Ph.D. Thesis, The Ohio State University, 2005.
H. Loftsson, "Correcting a PoS-tagged corpus using three complementary methods," Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 523-531, 2009.
H. Loftsson, J. H. Yngvason, S. Helgadottir, and E. Rognvaldsson, "Developing a POS-tagged corpus using existing tools," Proceedings of the 12th Conference of the European Chapter of the ACL, pages 523-531, 2009.
H. van Halteren "The detection of inconsistency in manually tagged text," Proceedings of the 2nd Workshop on Linguistically Interpreted Corpora, 2000.
M. Dickinson and W. D. Meurers, "Detecting errors in part-of-speech annotation," Proceedings of the 10th conference on European chapter of the Association for Computational Linguistics pp. 107-114. 2003.
E. Eskin, "Automatic corpus correction with anomaly detection," Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics pp. 148-153, 2000.
T. Nakagawa and Y. Matsumoto, "Detecting errors in corpora using support vector machines," Proceedings of the 17th International Conference on Computational Linguistics, pp. 709-715, 2002.
T. Ule and K. Simov, "Unexpected productions may well be errors", Proceedings of 4th International Conference on Language Resources and Evaluation, pp. 1795-1798, 2004.
Q. Ma, B.-L. Lu, M. Murata, M. Ichikawa and H. Isahara, "On-line error detection of annotated corpus using modular neural networks," Proceedings of the International Conference on Artificial Neural Networks, pp. 1185-1192, 2001
R. Reidsma, K. Tomanek, U. Hahn, and A. Rappoport, "Multi-task active learning for linguistic annotations," Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 861-869, 2008.
B. G. Chang, K. J. Lee and G. C. Kim, "Design and implement of tree tagging workbench to build a large tree tagged corpus of Korean," Proceedings of the 9th Annual Conference on Human and Cognitive Language Technology, pp. 421-429, 1997 (in Korean).
Y.-H. Noh, H. A. Lee, and G. C. Kim, "A workbench for domain adaptation of an MT lexicon with a target domain corpus," Proceedings of the 12th Annual Conference on Human and Cognitive Language Technology, pp. 163-168, 2000 (in Korean).
J.-H. Kim and E.-J. Park, "PPEditor: Semi-automatic annotation tool for Korean dependency structure," The Transaction of the Korean Information Processing Society, vol. 13-B, no. 1, pp. 63-70, 2006 (in Korean). https://doi.org/10.3745/KIPSTB.2006.13B.1.063
D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain, "Mixed-initiative development of language processing systems", Proceedings of the Applied Natural Language Processing Conference, pp. 348-355, 1997.
T. Morton and J. LaCivita, "WordFreak: An open tool for linguistic annotation," Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 17-18, 2003.
T. Brants and O. Plaehn, "Interactive corpus annotation," Proceedings of the 2nd International Conference on Language Resources and Engineering, pp. 453-459, 2000.
S. Chung, T. Kim, D. Hwang, and D.-I. Park, "Morphological generation system in English-Korean Machine Translation System MATES/EK," Proceedings of the Workshop on Research Projects of the Ministry of Science and Technology, pp. 10-13, 1990 (in Korean).
U. C. Choi, D. U. An, K.-S. Choi, and G. C. Kim, "Design and implementation of Korean generator for English-Korean Machine Translation," Proceedings of the Autumn Conference of KISS, vol. 17, no. 2, pp. 221-224, 1990 (in Korean).
H.-W. Seo, M.-K. Choi, Y.-R. Nam, H.-S. Kwon, and J.-H. Kim, "TagBench : A tool for building large corpora," Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology, pp. 126-131, 2012 (in Korean).
M.-G. Choi, Developing a Tool for Detecting and Correcting Errors in Sejong POS Tagged Corpus, Master's Thesis, Department of Computer Engineering, Korea Maritime University, 2012 (in Korean).
J.-H. Kim, A Study on a Corpus Construction Tool for Machine Translation, Research Report, Electronics and Telecommunications Research Institute (ETRI), 2012.

피인용 문헌

Analysis of Korean Language Parsing System and Speed Improvement of Machine Learning using Feature Module vol.51, pp.8, 2014, https://doi.org/10.5573/ieie.2014.51.8.066
Automatic Correction of Errors in Annotated Corpus Using Kernel Ripple-Down Rules vol.43, pp.6, 2013, https://doi.org/10.5626/jok.2016.43.6.636

Journal of Advanced Marine Engineering and Technology

한국어 품사 부착 말뭉치의 오류 검출 및 수정

Detecting and correcting errors in Korean POS-tagged corpora

초록

키워드

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)