DOI QR코드

DOI QR Code

Syllable-Level Lightweight Korean POS Tagger using Transformer Encoder

트랜스포머 인코더를 활용한 음절 단위 경량화 형태소 분석기

  • 민수영 (성균관대학교 AI시스템공학과 ) ;
  • 고영중 (성균관대학교 소프트웨어학과)
  • Received : 2024.08.16
  • Accepted : 2024.09.02
  • Published : 2024.10.31

Abstract

Morphological analysis involves segmenting morphemes, the smallest units of meaning or grammatical function in a language, and assigning part-of-speech tags to each morpheme. It plays a critical role in various natural language processing tasks, such as named entity recognition and dependency parsing. Much of modern natural language processing relies on deep learning-based language models, and Korean morphological analysis can be broadly categorized into sequence-to-sequence methods and sequential labeling methods. This study proposes a morphological analysis approach using the transformer encoder for sequential labeling to perform syllable-level part-of-speech tagging, followed by morpheme restoration and tagging through a pre-analyzed dictionary. Additionally, the CBOW method was used to extract syllable-level embeddings in lower dimensions, designing a lightweight morphological analyzer model with reduced parameters. The proposed model achieves fast inference speed and low parameter usage, making it efficient for use in resource-constrained environments.

형태소 분석은 의미를 가지거나 문법적 기능을 하는 언어의 최소 단위인 형태소를 분리하고, 각 형태소의 품사를 결정하는 작업으로, 개체명 인식, 의존구문 분석 등과 같은 자연어 처리 작업에서 중요한 역할을 한다. 현대 자연어처리의 많은 부분은 딥러닝 기반 언어 모델에 의존하고 있으며, 딥러닝 기반 한국어 형태소 분석은 크게 시퀀스-투-시퀀스 방식과 순차적 레이블링 방식으로 나뉜다. 본 연구에서는 트랜스포머 인코더를 활용한 순차적 레이블링 방식으로 음절 단위 품사 태깅을 수행한 후, 기분석 사전을 통해 형태소 복원 및 품사 태깅을 진행하는 형태소 분석 방식을 제안한다. 또한, CBOW 방식을 사용하여 음절 단위 임베딩을 낮은 차원으로 추출함으로써 파라미터 수를 줄인 경량화 형태소 분석기 모델을 설계하였다. 제안된 모델은 낮은 파라미터 수와 빠른 추론 속도를 통해 자원이 제한된 환경에서도 효율적으로 활용될 수 있다.

Keywords

Acknowledgement

이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구(No. RS-2024-00350379, 40)와 과학기술정보통신부 및 정보통신기획평가원의 생성AI선도인재양성사업(IITP-2024-RS-2024-00360227, 40), 그리고 정부(과학기술정보통신부)의 재원으로 정보통신 기획평가원의 지원을 받아 수행된 ICT명품인재양성사업(RS-2020-II201821, 20)의 연구 결과로 수행되었음.

References

  1. Z. Huang, W. Xu, and K. Yu. "Bidirectional LSTM-CRF models for sequence tagging." arXiv preprint arXiv:1508.01991, 2015. 
  2. H. M. Kim, J. M. Yoon, J. H. An, K. M. Bae, and Y. J. Ko, "Syllable-based Korean POS Tagging using POS Distribution and Bidirectional LSTM CRFs," in Proceedings of the 28th Human & Cognitive Language Technology, pp.3-8, 2016. (in Korean) 
  3. X. Ma and E. Hovy, "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol.1, pp.1064-1074, 2016. 
  4. S. W. Kim and S. P. Choi, "Research on Joint Models for Korean Word Spacing and POS (Part-Of-Speech) Tagging based on Bidirectional LSTM-CRF," Journal of Korean Institute of Information Scientists and Engineers, Vol.45, No.8, pp.792-800, 2018. (in Korean) 
  5. E. I. Chung, and J. G. Park, "Word Segmentation and POS tagging using Seq2seq Attention Model," in Proceddings of the 28th Human & Cognitive Language Technology, pp.217-219, 2016. (in Korean) 
  6. J. Li, E. H. Lee, and J. H. Lee, "Sequence-to-sequence based Morphological Analysis and Part-Of-Speech Tagging for Korean Language with Convolutional Features," Journal of Korean Institute of Information Scientists and Engineers, Vol.44, No.1, pp.57-62, 2017. (in Korean) 
  7. J. W. Min, S. H. Na, J. H. Shin, and Y. K. Kim, "End-to-End Neural Transition-based Morpheme Segmentation and POS Tagging of Korean," in Proceedings of the Korean Information Science Society Conference, pp.566-568, 2019. (in Korean) 
  8. Y. S. Choi and K. J. Lee, "Performance Analysis of Korean Morphological Analyzer based on Transformer and BERT," Journal of Korean Institute of Information Scientists and Engineers, Vol.47, No.8, pp.730-741, 2020. (in Korean) 
  9. B. S. Choe, I. H. Lee, and S. G. Lee, "Korean Morphological Analyzer for Neologism and Spacing Error based on Sequence-to-Sequence," Journal of Korean Institute of Information Scientists and Engineers, Vol.47, No.1, pp.70-77, 2020. (in Korean) 
  10. J. Y. Youn and J. S. Lee, "A Deep Learning-based Two-Steps Pipeline Model for Korean Morphological Analysis and Part-of-Speech Tagging," Journal of Korean Institute of Information Scientists and Engineers, Vol.48, No.4, pp.444-452, 2021. (in Korean) 
  11. A. Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems, Vol.30, 2017. 
  12. T. Mikolov et al., "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013. 
  13. C. J. Lee and D. Y. Ra, "Korean Morphological Analysis Method Based on BERT-Fused Transformer Model," THE KIPS Transactions on Software and Data Engineering, Vol.11, No.4, pp.169-178, 2022. (in Korean) 
  14. KorBERT [Internet], https://aiopen.etri.re.kr/bertModel