A Joint Statistical Model for Word Spacing and Spelling Error Correction Simultaneously

Noh, Hyung-Jong;Cha, Jeong-Won;Lee, GaryGeun-Bae;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 34 Issue 2
/
Pages.131-139
/
2007
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

A Joint Statistical Model for Word Spacing and Spelling Error Correction Simultaneously

띄어쓰기 및 철자 오류 동시교정을 위한 통계적 모델

노형종 (포항공과대학교 컴퓨터공학과) ;
차정원 (창원대학교 컴퓨터공학과) ;
이근배 (포항공과대학교 컴퓨터공학과)

Published : 2007.02.15

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we present a preprocessor which corrects word spacing errors and spelling correction errors simultaneously. The proposed expands noisy-channel model so that it corrects both errors in colloquial style sentences effectively, while preprocessing algorithms have limitations because they correct each error separately. Using Eojeol transition pattern dictionary and statistical data such as n-gram and Jaso transition probabilities, it minimizes the usage of dictionaries and produces the corrected candidates effectively. In experiments we did not get satisfactory results at current stage, we noticed that the proposed methodology has the utility by analyzing the errors. So we expect that the preprocessor will function as an effective error corrector for general colloquial style sentence by doing more improvements.

본 논문에서는 띄어쓰기 오류와 철자 오류를 동시에 교정 가능한 전처리기를 제안한다. 제시된 알고리즘은 기존의 전처리기 알고리즘이 각 오류를 따로 해결하는 데에서 오는 한계를 극복하고, 기존의 noisy-channel model을 확장하여 대화체의 띄어쓰기 오류와 철자 오류를 동시에 효과적으로 교정할 수 있다. N-gram과 자소변환확률 등의 통계적 방법과 어절변환패턴 사전을 이용하여 최대한 사전을 적게 이용하면서도 효과적으로 교정 후보들을 생성할 수 있다. 실험을 통해 현재 단계에서는 만족할 만한 성능을 얻지는 못하였지만 오류 분석을 통하여 이와 같은 방법론이 실제로 효용성이 있음을 알 수 있었고 앞으로 더 많은 개선을 통해 일상적인 대화체 문장에 대해서 효과적인 전처리기로서 기능할 수 있을 것으로 기대된다.

Keywords

References

권오욱, '마코프 체인 및 음절 N-그램을 이용한 한국어 띄어쓰기 및 복합명사 분리', 한국음향학회지, 2002, pp. 274-283
Jianfeng Gao, Mu Li and Chang-Ning Huang, 'Improved Source-Channel Models for Chinese Word Segmentation,' Proceeding of the 41st Annual Meeting of the ACL, 2003, pp. 272-279
Christopher C. Yang and K. W. Li, 'A Heuristic Method Based on a Statistical Approach for Chinese Text Segmentation,' Journal of the American Society for Information Science and Technology, 2005, pp. 1438-1447
Eric Mays, Fred J. Damerau and Robert L. Mercer, 'Context Based Spelling Correction,' IP&M, 1991, pp. 517-522
R. L Kashyap, B. J. Oommen, 'Spelling Correction Using Probabilistic Methods,' Pattern Recognition Letters, 1984, pp. 147-154
Mu Li, Muhua Zhu, Yang Zhang and Ming Zhou, 'Exploring Distributional Similarity Based Models for Query Spelling Correction,' Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, 2006, pp. 1025-1032
Masaaki Nagata, 'Context-Based Spelling Correction for Japanese OCR,' Proceedings of the 16th conference on Computational linguistics, 1996, pp. 806-811
Seung-Shik Kang and Chong-Woo Woo, 'Automatic Segmentation of Words using Syllable Bigram Statistics,' Proceedings of 6th Natural Language Processing Pacific Rim Symposium, 2001, pp. 729-732
Yunqing Xia, Kam-Fei Wong and Wenjie Li, 'A Phonetic-Based Approach to Chinese Chat Text Normalization,' Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, 2006, pp. 993-1000

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

A Joint Statistical Model for Word Spacing and Spelling Error Correction Simultaneously

띄어쓰기 및 철자 오류 동시교정을 위한 통계적 모델

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)