DOI QR코드

DOI QR Code

Transformer-based reranking for improving Korean morphological analysis systems

  • Jihee Ryu (Language Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Soojong Lim (Language Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Oh-Woog Kwon (Language Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
  • Seung-Hoon Na (Division of Computer Science and Engineering, Jeonbuk National University)
  • 투고 : 2023.08.29
  • 심사 : 2023.12.22
  • 발행 : 2024.02.20

초록

This study introduces a new approach in Korean morphological analysis combining dictionary-based techniques with Transformer-based deep learning models. The key innovation is the use of a BERT-based reranking system, significantly enhancing the accuracy of traditional morphological analysis. The method generates multiple suboptimal paths, then employs BERT models for reranking, leveraging their advanced language comprehension. Results show remarkable performance improvements, with the first-stage reranking achieving over 20% improvement in error reduction rate compared with existing models. The second stage, using another BERT variant, further increases this improvement to over 30%. This indicates a significant leap in accuracy, validating the effectiveness of merging dictionary-based analysis with contemporary deep learning. The study suggests future exploration in refined integrations of dictionary and deep learning methods as well as using probabilistic models for enhanced morphological analysis. This hybrid approach sets a new benchmark in the field and offers insights for similar challenges in language processing applications.

키워드

과제정보

This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2023-00216011, Development of artificial complex intelligence for conceptually understanding and inferring like human). We would like to thank Editage (www. editage.co.kr) and Soomgo (soomgo.com) for English language editing.

참고문헌

  1. T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, (1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA), May 2-4, 2013, 2013. 
  2. H.-J. Song, Subword tokenization and Korean morphological analysis, Commun. KIISE 39 (2021), no. 4, 15-20. 
  3. Y. Choi and K. J. Lee, Performance analysis of Korean morphological analyzer based on transformer and BERT, J. KIISE 47 (2020), no. 8, 730-741.  https://doi.org/10.5626/JOK.2020.47.8.730
  4. E. Chung and J.-G. Park, Word segmentation and POS tagging using Seq2seq attention model, (Proceedings of the 28th Annual Conference on Human and Cognitive Language Technology), 2016, pp. 217-219. 
  5. H. Hwang and C. Lee, Korean morphological analysis using sequence-to-sequence learning with copying mechanism, (Proceedings of the 43rd Winter Congress of the KIISE), 2016, pp. 443-445. 
  6. H. Hwang and C. Lee, Linear-time Korean morphological analysis using an action-based local monotonic attention mechanism, ETRI J. 42 (2020), no. 1, 101-107.  https://doi.org/10.4218/etrij.2018-0456
  7. H. Kim, S. Park, and H. Kim, Joint model of morphological analysis and named entity recognition using shared layer, J. KIISE 48 (2021), no. 2, 167-173.  https://doi.org/10.5626/JOK.2021.48.2.167
  8. H. Kim, S. Yang, and Y. Ko, How to utilize syllable distribution patterns as the input of LSTM for Korean morphological analysis, Pattern Recogn. Lett. 120 (2019), 39-45.  https://doi.org/10.1016/j.patrec.2018.12.019
  9. H. Kim, J. Yoon, J. An, K. Bae, and Y. Ko, Syllable-based Korean POS tagging using POS distribution and bidirectional LSTM CRFs, (Proceedings of the 28th Annual Conference on Human and Cognitive Language Technology), 2016, pp. 3-8. 
  10. J. Kim, S. Kang, and H. Kim, Korean head-tail tokenization and part-of-speech tagging by using deep learning, IEMEK J. Embedded Syst. Appl. 17 (2022), no. 4, 199-208. 
  11. S.-W. Kim and S.-P. Choi, Research on joint models for Korean word spacing and POS (part-of-speech) tagging based on bidirectional LSTM-CRF, J. KIISE 45 (2018), no. 8, 792-800.  https://doi.org/10.5626/JOK.2018.45.8.792
  12. H.-C. Kwon, A dictionary-based morphological analysis, (Proc. of NLPRS'91), 1991, pp. 178-185. 
  13. C. Lee, Joint models for Korean word spacing and POS tagging using structural SVM, J. KISS: Softw. Appl. 40 (2013), no. 12, 826-832. 
  14. C.-H. Lee, J.-H. Lim, S. Lim, and H.-K. Kim, Syllable-based Korean POS tagging based on combining a pre-analyzed dictionary with machine learning, J. KIISE 43 (2016), no. 3, 362-369.  https://doi.org/10.5626/JOK.2016.43.3.362
  15. D.-G. Lee and H.-C. Rim, Probabilistic modeling of Korean morphology, IEEE Trans. Audio, Speech, Lang. Process. 17 (2009), no. 5, 945-955.  https://doi.org/10.1109/TASL.2009.2019922
  16. J. S. Lee, Three-step probabilistic model for Korean morphological analysis, J. KISS: Softw. Appl. 38 (2011), no. 5, 257-268. 
  17. J. Li, E. Lee, and J.-H. Lee, Sequence-to-sequence based morphological analysis and part-of-speech tagging for Korean language with convolutional features, J. KIISE 44 (2017), no. 1, 57-62.  https://doi.org/10.5626/JOK.2017.44.1.57
  18. J.-W. Min, S.-H. Na, J.-H. Sin, and Y.-K. Kim, Dynamic oracle for neural transition-based morpheme segmentation and POS tagging of Korean, (Proceedings of the 30th Annual Conference on Human and Cognitive Language Technology), 2018, pp. 413-416. 
  19. J. Min, S.-H. Na, J.-H. Shin, and Y.-K. Kim, End-to-end neural transition-based morpheme segmentation and POSTagging of Korean, (Proceedings of the Korea Computer Congress), 2019, pp. 566-568. 
  20. J. Min, S.-H. Na, J.-H. Shin, and Y.-K. Kim, Stack pointer network for Korean morphological analysis, (Proceedings of the Korea Computer Congress), 2020, pp. 371-373. 
  21. J. Min, S.-H. Na, J.-H. Shin, and Y.-K. Kim, Interleaved decoder in sequence-to-sequence model for morphological analysis and part-of-speech tagging of Korean, (Proceedings of the Korea Computer Congress), 2022, pp. 467-469. 
  22. S.-H. Na, Conditional random fields for Korean morpheme segmentation and POS tagging, ACM Trans. Asian Low-Resource Lang. Inform. Process. 14 (2015), no. 3, 1-16.  https://doi.org/10.1145/2700051
  23. S.-H. Na, C.-H. Kim, and Y.-K. Kim, Lattice-based discriminative approach for Korean morphological analysis, J. KISS: Softw. Appl. 41 (2014), no. 7, 523-532. 
  24. S.-H. Na and Y.-K. Kim, Phrase-based statistical model for Korean morpheme segmentation and POS tagging, IEICE Trans. Inform. Syst. 101 (2018), no. 2, 512-522.  https://doi.org/10.1587/transinf.2017EDP7085
  25. Y. Seok Choi and K. J. Lee, A reranking model for Korean morphological analysis based on sequence-to-sequence model, KIPS Trans. Softw. Data Eng. 7 (2018), no. 4, 121-128. 
  26. K. Shim, Syllable-based POS tagging without Korean morphological analysis, Korean J. Cognit. Sci. 22 (2011), no. 3, 327-345.  https://doi.org/10.19066/cogsci.2011.22.3.005
  27. H. J. Shin, J. Park, and J. S. Lee, Syllable-based multi-POSMORPH annotation for Korean morphological analysis and part-of-speech tagging, Appl. Sci. 13 (2023), no. 5, 2892. 
  28. J.-C. Shin and C.-Y. Ock, A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary, J. KISS: Softw. Appl. 39 (2012), no. 5, 415-424. 
  29. H.-J. Song and S.-B. Park, Korean morphological analysis with tied sequence-to-sequence multi-task model, (Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China), 2019, pp. 1436-1441. 
  30. H.-J. Song and S.-B. Park, Korean part-of-speech tagging based on morpheme generation, ACM Trans. Asian Low-Resource Lang. Inform. Process. 19 (2020), no. 3, 1-10.  https://doi.org/10.1145/3365679
  31. J. Y. Youn and J. S. Lee, A deep learning-based two-steps pipeline model for Korean morphological analysis and partof-speech tagging, J. KIISE 48 (2021), no. 4, 444-452.  https://doi.org/10.5626/JOK.2021.48.4.444
  32. T. Kudo, MeCab: yet another part-of-speech and morphological analyzer [Online]. Available: https://taku910.github.io/mecab/. (accessed 2023, Aug. 25). 
  33. T. Kudo, K. Yamamoto, and Y. Matsumoto, Applying conditional random fields to Japanese morphological analysis, (Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain), 2004, pp. 230-237. 
  34. Y. Bae, H. Kim, J.-H. Lim, H. Ki Kim, and K. J. Lee, 2-Phase passage re-ranking model based on neural-symbolic ranking models, J. KIISE 48 (2021), no. 5, 501-509.  https://doi.org/10.5626/JOK.2021.48.5.501
  35. R. Nogueira, W. Yang, K. Cho, and J. Lin, Multi-stage document ranking with BERT, arXiv preprint, 2019. DOI 10.48550/ arXiv.1910.14424 . 
  36. M. Choe and B. Kang, Practice in constructing Sejong morph (sense) analysis Corpora, Korean Cult. Stud. 48 (2008), 337-372. 
  37. University of Ulsan, UCorpus-HG: Morph-sense tagged Corpus [Online]. Available: http://nlplab.ulsan.ac.kr/doku.php?id=ucorpus. (accessed 2023, Aug. 25). 
  38. I. Kim, D.-G. Lee, and B. Kang, SJ-RIKS Corpus: beyond 21st Sejong morph-sense tagged corpus, Korean Cult. Stud. 52 (2010), 373-403. 
  39. National Institute of Korean Language, Everyone's Corpus [Online]. Available: https://corpus.korean.go.kr. (accessed 2023, Aug. 25). 
  40. I. Kim: Conducting Korean POS tagged corpus. Project Report 11-1371028-000776-01. National Institute of Korean Language, 2019. (in Korean). 
  41. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: pretraining of deep bidirectional transformers for language understanding, (Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies), 2019, pp. 4171- 4186. 
  42. Korea Press Foundation, KPF BERT [Online]. Available: https://github.com/KPFBERT/kpfbert. (accessed 2023, Dec. 4).
  43. Electronics and Telecommunications Research Institute, Kor-BERT [Online]. Available: https://aiopen.etri.re.kr/bertModel. (accessed 2023, Dec. 4).