DOI QR코드

DOI QR Code

Improved Statistical Language Model for Context-sensitive Spelling Error Candidates

문맥의존 철자오류 후보 생성을 위한 통계적 언어모형 개선

  • Lee, Jung-Hun (Dept. of Electrical and Computer Eng., Graduate School, Pusan National University) ;
  • Kim, Minho (Dept. of Electrical and Computer Eng., Graduate School, Pusan National University) ;
  • Kwon, Hyuk-Chul (Dept. of Information Computer Science., College of Eng., Pusan National University)
  • Received : 2017.01.13
  • Accepted : 2017.02.06
  • Published : 2017.02.28

Abstract

The performance of the statistical context-sensitive spelling error correction depends on the quality and quantity of the data for statistical language model. In general, the size and quality of data in a statistical language model are proportional. However, as the amount of data increases, the processing speed becomes slower and storage space also takes up a lot. We suggest the improved statistical language model to solve this problem. And we propose an effective spelling error candidate generation method based on a new statistical language model. The proposed statistical model and the correction method based on it improve the performance of the spelling error correction and processing speed.

Keywords

References

  1. C.W. Young, C.M. Eastman, and R.L. Oakman, "An Analysis of Ill-formed Input in Natural Language Queries to Document Retrieval Systems," Information Processing and Management, Vol. 27, No. 6, pp. 615-622, 1991. https://doi.org/10.1016/0306-4573(91)90002-4
  2. A.M. Wing and A.D. Baddeley, "Spelling Errors in Handwriting: A Corpus and Distributional Analysis," Cognitive Processes in Spelling, Academic Press, London, 1980.
  3. H.S. Choi, A.S. Yoon, and H.C. Kwon, "Improving Recall for Context-Sensitive Spelling Correction Rules Through Integrated Constraint Loosening Method," Korean Institute of Information Scientists and Engineers Transactions on Computing Practices, Vol. 21, No. 6, pp. 412-417, 2015.
  4. H.S. Choi, H.C. Kwon, and A.S. Yoon, "Improving Recall for Context-Sensitive Spelling Correction Rules using Conditional Probability Model with Dynamic Window Sizes," Journal of Korean Institute of Information Scientists and Engineers, Vol. 42, No. 5, pp. 629-636, 2015.
  5. D. III Hal and D. Marcu. "A Noisy-channel Model for Document Compression," Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 449- 456, 2002.
  6. Kolak, Okan, and P. Resnik. "OCR Error Correction Using a Noisy Channel Model," Proceedings of the Second International Conference on Human Language Technology Research, pp. 257-262, 2002.
  7. D. Yuret, and M.A. Yatbaz. "The Noisy Channel Model for Unsupervised Word Sense Disambiguation," Computational Linguistics, Vol. 36, No. 1, pp. 111-127, 2010. https://doi.org/10.1162/coli.2010.36.1.36103
  8. M.J. Kim, S.Y. Suk, K.S. Kim, H.Y. Jung and H.Y. Chung, "Hybrid Method using Frame Selection and Weighting Model Rank to improve Performance of Real-time Text-Independent Speaker Recognition System based on GMM," Journal of Korea Multimedia Society, Vol. 5, No. 5, pp. 515-522, 2002.
  9. A.L. Berger, S.A.D. Pietra, and V.J.D. Pietra, "A Maximum Entropy Approach to Natural Language Processing," Computational Linguistics, Vol. 22, No. 1, pp. 39-71, 1996.
  10. T. Mikolov, S. Kombrink, A. Deoras, L. Burget and J. Cernocky, "Extensions of Recurrent Neural Network Language Model," Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5528-5531, 2011.
  11. M. Collins, B. Roark, and M. Saraclar, "Discriminative Syntactic Language Modeling for Speech Recognition," Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 507-514, 2005.
  12. M.H. Kim, S.K. Choi, and H.C. Kwon, "Context-sensitive Spelling Error Correction Using Eojeol N-gram," Journal of Korean Institute of Information Scientists and Engineers, Vol. 41, No. 12, pp. 1081-1089, 2014.
  13. M.H. Kim, S.K. Choi, J.Z. Jin, and H.C. Kwon, "Adaptive Context-Sensitive Spelling Error Correction Techniques for the Extremely Unpredictable Error Generating Language Environments," Proceedings of 2015 IEEE International Conference on Computer and Information Technology, pp. 927-928, 2015.
  14. A. Islam and D. Inkpen, "Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity," ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 2, pp. 1-25, 2008.
  15. A. Islam and D. Inkpen, "Real-Word Spelling Correction Using Google Web 1T 3-grams," Proceeding of International Conference on Natural Language Processing and Knowledge Engineering, Vol. 3, pp. 1241-1249, 2009.
  16. A. Islam and D. Inkpen, "Real-word Spelling Correction Using Google Web 1T n-gram Data Set," Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1689-1692, 2010.
  17. A. Wilcox-O'Hearn, G. Hirst, and A. Budanitsky, "Real-word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model," Proceedings of 9th International Conference on Intelligent Text Processing and Computational Linguistics, Vol. 4919, pp. 605-616, 2008.

Cited by

  1. 통계적 문맥의존 철자오류 교정 기법의 향상을 위한 지역적 문서 정보의 활용 vol.23, pp.7, 2017, https://doi.org/10.5626/ktcp.2017.23.7.446