DOI QR코드

DOI QR Code

Constructing for Korean Traditional culture Corpus and Development of Named Entity Recognition Model using Bi-LSTM-CNN-CRFs

한국 전통문화 말뭉치구축 및 Bi-LSTM-CNN-CRF를 활용한 전통문화 개체명 인식 모델 개발

  • Kim, GyeongMin (Department of Computer Science and Engineering, Korea University) ;
  • Kim, Kuekyeng (Department of Computer Science and Engineering, Korea University) ;
  • Jo, Jaechoon (Department of Computer Science and Engineering, Korea University) ;
  • Lim, HeuiSeok (Department of Computer Science and Engineering, Korea University)
  • Received : 2018.09.30
  • Accepted : 2018.11.20
  • Published : 2018.12.28

Abstract

Named Entity Recognition is a system that extracts entity names such as Persons(PS), Locations(LC), and Organizations(OG) that can have a unique meaning from a document and determines the categories of extracted entity names. Recently, Bi-LSTM-CRF, which is a combination of CRF using the transition probability between output data from LSTM-based Bi-LSTM model considering forward and backward directions of input data, showed excellent performance in the study of object name recognition using deep-learning, and it has a good performance on the efficient embedding vector creation by character and word unit and the model using CNN and LSTM. In this research, we describe the Bi-LSTM-CNN-CRF model that enhances the features of the Korean named entity recognition system and propose a method for constructing the traditional culture corpus. We also present the results of learning the constructed corpus with the feature augmentation model for the recognition of Korean object names.

개체명 인식(Named Entity Recognition)시스템은 문서로부터 고유한 의미를 가질 수 있는 인명(PS), 지명(LC), 기관명(OG) 등의 개체명을 추출하고 추출된 개체명의 범주를 결정하는 시스템이다. 최근 딥러닝 방식을 이용한 개체명 인식 연구에서 입력 데이터의 앞, 뒤 방향을 고려한 LSTM 기반의 Bi-LSTM 모델로부터 출력 데이터 간의 전이 확률을 이용한 CRF를 결합한 방식의 Bi-LSTM-CRF가 우수한 성능을 보이고, 문자 및 단어 단위의 효율적인 임베딩 벡터생성에 관한 연구와 CNN, LSTM을 활용한 모델에서도 좋은 성능을 보여주고 있다. 본 연구에서는 한국어 개체명 인식시스템 성능 향상을 위해 자질을 보강한 Bi-LSTM-CNN-CRF 모델에 관해 기술하고 전통문화 말뭉치구축 방식에 대해 제안한다. 그리고 구축한 말뭉치를 한국어 개체명 인식 성능 향상을 위한 자질 보강 모델 Bi-LSTM-CNN-CRF로 학습한 결과에 대해 제안한다.

Keywords

OHHGBW_2018_v9n12_47_f0001.png 이미지

Fig. 1. System Architecture of Named Entity Recognition Using Bi-LSTM-CNN-CRF

OHHGBW_2018_v9n12_47_f0002.png 이미지

Fig. 2. CNN-extracted char features

OHHGBW_2018_v9n12_47_f0003.png 이미지

Fig. 3. Bi-LSTM-CRF

OHHGBW_2018_v9n12_47_f0004.png 이미지

Fig. 4. Extract data and Create morpheme unit

OHHGBW_2018_v9n12_47_f0006.png 이미지

Fig. 5. Results using Traditional culture Corpus

OHHGBW_2018_v9n12_47_f0007.png 이미지

Fig. 6. Results using Traditional culture Corpus2

OHHGBW_2018_v9n12_47_f0008.png 이미지

Fig. 7. Incorrect tagged NER system

Table 1. Category and Tag ratio among total corpus

OHHGBW_2018_v9n12_47_t0001.png 이미지

References

  1. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
  2. Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
  3. Ling, W., Trancoso, I., Dyer, C., & Black, A. W. (2015). Character-based neural machine trans- lation. arXiv preprint arXiv:1511.04586.
  4. Chiu, J. P., & Nichols, E. (2015). Named entity recognition with bidirectional LSTM-CNNs. arXiv preprint arXiv:1511.08308.
  5. Nadeau D., Turney, P. D., & Matwin, S. (2006). Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 266-277). Springer, Berlin, Heidelberg. DOI : 10.12811/JKCS.201.11.2.129
  6. Zhu, X. (2006). Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3), 4. DOI : 10.22156/JKCS.2018.7.1.001
  7. Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., ... & Bontcheva, K. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32-49. https://doi.org/10.1016/j.ipm.2014.10.006
  8. Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee inter- national conference on (pp. 6645-6649). IEEE.
  9. Cho, K., Van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
  10. Santos, C. D., & Zadrozny, B. (2014). Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1818-1826).
  11. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
  12. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
  13. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  14. S. H. Na & M. W. Min. (2016). Character Based LSTM CRFs for Named Entity Recognition, Korea Computer Congress (KCC).
  15. D. Y. Lee, W. H. Yu, & H. S. Lim. (2017). Bi-directional LSTM-CNN-CRF for Korean Named Entity Recognition System with Feature Augmentation. Journal of the Korea Convergence Society[KCI], 8(12).