A Joint Statistical Model for Word Spacing and Spelling Error Correction Simultaneously

띄어쓰기 및 철자 오류 동시교정을 위한 통계적 모델

  • 노형종 (포항공과대학교 컴퓨터공학과) ;
  • 차정원 (창원대학교 컴퓨터공학과) ;
  • 이근배 (포항공과대학교 컴퓨터공학과)
  • Published : 2007.02.15

Abstract

In this paper, we present a preprocessor which corrects word spacing errors and spelling correction errors simultaneously. The proposed expands noisy-channel model so that it corrects both errors in colloquial style sentences effectively, while preprocessing algorithms have limitations because they correct each error separately. Using Eojeol transition pattern dictionary and statistical data such as n-gram and Jaso transition probabilities, it minimizes the usage of dictionaries and produces the corrected candidates effectively. In experiments we did not get satisfactory results at current stage, we noticed that the proposed methodology has the utility by analyzing the errors. So we expect that the preprocessor will function as an effective error corrector for general colloquial style sentence by doing more improvements.

References

  1. 권오욱, '마코프 체인 및 음절 N-그램을 이용한 한국어 띄어쓰기 및 복합명사 분리', 한국음향학회지, 2002, pp. 274-283
  2. Jianfeng Gao, Mu Li and Chang-Ning Huang, 'Improved Source-Channel Models for Chinese Word Segmentation,' Proceeding of the 41st Annual Meeting of the ACL, 2003, pp. 272-279
  3. Christopher C. Yang and K. W. Li, 'A Heuristic Method Based on a Statistical Approach for Chinese Text Segmentation,' Journal of the American Society for Information Science and Technology, 2005, pp. 1438-1447
  4. Eric Mays, Fred J. Damerau and Robert L. Mercer, 'Context Based Spelling Correction,' IP&M, 1991, pp. 517-522
  5. R. L Kashyap, B. J. Oommen, 'Spelling Correction Using Probabilistic Methods,' Pattern Recognition Letters, 1984, pp. 147-154
  6. Mu Li, Muhua Zhu, Yang Zhang and Ming Zhou, 'Exploring Distributional Similarity Based Models for Query Spelling Correction,' Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, 2006, pp. 1025-1032
  7. Masaaki Nagata, 'Context-Based Spelling Correction for Japanese OCR,' Proceedings of the 16th conference on Computational linguistics, 1996, pp. 806-811
  8. Seung-Shik Kang and Chong-Woo Woo, 'Automatic Segmentation of Words using Syllable Bigram Statistics,' Proceedings of 6th Natural Language Processing Pacific Rim Symposium, 2001, pp. 729-732
  9. Yunqing Xia, Kam-Fei Wong and Wenjie Li, 'A Phonetic-Based Approach to Chinese Chat Text Normalization,' Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, 2006, pp. 993-1000