DOI QR코드

DOI QR Code

BERT-based Classification Model for Korean Documents

한국어 기술문서 분석을 위한 BERT 기반의 분류모델

  • Hwang, Sangheum (Department of Industrial & Information Systems Engineering, Seoul National University of Science and Technology) ;
  • Kim, Dohyun (Department of Industrial and Management Engineering, Myongji University)
  • Received : 2020.01.10
  • Accepted : 2020.02.25
  • Published : 2020.02.28

Abstract

It is necessary to classify technical documents such as patents, R&D project reports in order to understand the trends of technology convergence and interdisciplinary joint research, technology development and so on. Text mining techniques have been mainly used to classify these technical documents. However, in the case of classifying technical documents by text mining algorithms, there is a disadvantage that the features representing technical documents must be directly extracted. In this study, we propose a BERT-based document classification model to automatically extract document features from text information of national R&D projects and to classify them. Then, we verify the applicability and performance of the proposed model for classifying documents.

최근 들어 기술개발 현황, 신규기술 분야 출현, 기술융합과 학제 공동연구, 기술의 트렌드 변화 등을 파악하기 위해 R&D 과제정보, 특허와 같은 기술문서의 분류정보가 많이 활용되고 있다. 이러한 기술문서를 분류하기 위해 주로 텍스트마이닝 기법들이 활용되어 왔다. 그러나 기존 텍스트마이닝 방법들로 기술문서를 분류하기 위해서는 기술문서들을 대표하는 특징들을 직접 추출해야 하는 한계점이 있다. 따라서 본 연구에서는 딥러닝 기반의 BERT모델을 활용하여 기술문서들로부터 자동적으로 문서 특징들을 추출한 후, 이를 문서 분류에 직접 활용하는 모델을 제안하고, 이에 대한 성능을 검증하고자 한다. 이를 위해 텍스트 기반의 국가 R&D 과제 정보를 활용하여 BERT 기반 국가 R&D 과제의 중분류코드 예측 모델을 생성하고 이에 대한 성능을 평가한다.

Keywords

References

  1. Devlin, J., Chang, M. W., and Lee, K. T., "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv:1810.04805, 2018.
  2. Jo, H., Kim, J. H., Yoon, S., Kim, K. M., and Zhang, B. T., "Large-scale text classification methodology with convolutional neural network," Proceedings of the 2015 Korean Information Science Society Conference, pp. 792-794, 2015.
  3. Kim, J. M. and Lee, J. H., "Text document classification based on recurrent neural network using word2vec," Journal of Korean Institute of Intelligent Systems, Vol. 27, No. 6, 2017.
  4. Kim, Y., "Convolutional neural network for sentence classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-1751, 2014.
  5. Kim, Y. J., Kim, T. H., Lim, C. S., and Kim, J. S., "A study on NTIS standard code and classification service development," Proceedings of the 2007 Korea Contents Association Conference, pp. 376-380, 2007.
  6. Kingma, D. and Ba, J., "Adam: A method for stochastic optimization," Proceedings of the 3rd International Conference on Learning Representations, 2015.
  7. Oh, S. W., Lee, H., Shin, J. Y., and Lee, J. H., "Antibiotics-resistant bacteria infection prediction based on deep learning," The Journal of Society for e-Business Studies, Vol. 24, No. 1, pp. 105-120, 2019.
  8. Srivastava, N., Hinton, G., krizhevsky, A., Sutskever, I., and Salakhutdinov, R., "Dropout: A simple way to prevent neural networks from overfitting," Journal of Machine Learning Research, Vol. 15, pp. 1929-1958, 2014.
  9. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I., "Attention is all you need," Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
  10. Yang, Y. J., Lee, B. H., Kim, J. S., and Lee, K. Y., "Development of an automatic classification system for game reviews based on word embedding and vector similarity," The Journal of Society for e-Business Studies, Vol. 24, No. 2, pp. 1-14, 2019.
  11. Yoon, D., Kim, S., and Kim, D., “Clustering of time series data using deep learning,” Journal of Applied Reliability, Vol. 19, No. 2, pp. 167-178, 2019. https://doi.org/10.33162/JAR.2019.06.19.2.167
  12. Young, T., Hazarika, D., Poria, S., and Cambria, E., "Recent trends in deep learning based natural language processing," arXiv:1708.02709, 2017.

Cited by

  1. 지식 간 내용적 연관성 파악 기법의 지식 서비스 관리 접목을 위한 정량적/정성적 고려사항 검토 vol.26, pp.3, 2020, https://doi.org/10.7838/jsebs.2021.26.3.119