DOI QR코드

DOI QR Code

An Automated Industry and Occupation Coding System using Deep Learning

딥러닝 기법을 활용한 산업/직업 자동코딩 시스템

  • Lim, Jungwoo (Department of Computer Science and Engineering, Korea University) ;
  • Moon, Hyeonseok (Department of Computer Science and Engineering, Korea University) ;
  • Lee, Chanhee (Department of Computer Science and Engineering, Korea University) ;
  • Woo, Chankyun (Survey System Management Division) ;
  • Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
  • 임정우 (고려대학교 컴퓨터학과) ;
  • 문현석 (고려대학교 컴퓨터학과) ;
  • 이찬희 (고려대학교 컴퓨터학과) ;
  • 우찬균 (통계청 조사시스템관리과) ;
  • 임희석 (고려대학교 컴퓨터학과)
  • Received : 2021.02.05
  • Accepted : 2021.04.20
  • Published : 2021.04.28

Abstract

An Automated Industry and Occupation Coding System assigns statistical classification code to the enormous amount of natural language data collected from people who write about their industry and occupation. Unlike previous studies that applied information retrieval, we propose a system that does not need an index database and gives proper code regardless of the level of classification. Also, we show our model, which utilized KoBERT that achieves high performance in natural language downstream tasks with deep learning, outperforms baseline. Our method achieves 95.65%, 91.51%, and 97.66% in Occupation/Industry Code Classification of Population and Housing Census, and Industry Code Classification of Census on Basic Characteristics of Establishments. Moreover, we also demonstrate future improvements through error analysis in the respect of data and modeling.

본 산업/직업 자동코딩 시스템은 조사 대상자들이 응답한 방대한 양의 산업/직업을 설명하는 자연어 데이터에 통계 분류 코드를 자동으로 부여하는 시스템이다. 본 연구는 기존의 정보검색 기반의 산업/직업 자동코딩시스템과 다르게 딥러닝을 이용하여 색인 DB가 필요하지 않고 분류 수준에 상관없이 코드를 부여할 수 있는 시스템을 제안한다. 또한, 자연어 처리에 특화된 딥러닝 기법인 KoBERT를 적용한 제안 모델은 인구주택총조사 산업/직업 코드 분류, 그리고 사업체기초조사 산업 코드 분류에서 각각 95.65%, 91.45%, 97.66%의 Top 10 정확도를 보인다. 제안한 모델 실험 후 향후 개선 가능성을 데이터/모델링 관점으로 분석한다.

Keywords

References

  1. Y. K. Kang. (2001). Automatic coding system for industry and occupation classification. The Korean Association for Survey Research. Fall Conference 2001, 33-45.
  2. Population and Housing Census. (2020) Understanding of the Census. https://www.census.go.kr/cui/cuiDefView.do?q_menu=3&q_sub=1
  3. Statistics Korea. (Year Unknown) Statistics Korea Census on Establishments . https://kostat.go.kr/understand/info/info_kost/1/index.action?bmode=read&cd=S010004
  4. H. S. Lim. (2004). An automated Classification System of Standard Industry and Occupation Codes by Using Information Retrieval Techniques. The Journal of Korean Association of Computer Education 7(4), 51-60.
  5. C. K. Woo. (2020). A Study on Automatic Coding of Korean Standard Industrial Classification Based on Deep Learning. Masters dissertation. Korea University, Seoul.
  6. H. D. Cheol. (2007). A Research on the Design and Implementation of the Automated Industry and Occupation Coding System. Masters dissertation. Hannam University, Daejeon
  7. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez & I. Polosukhin. (2017, December). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000-6010).
  8. M. Thompson, M. E. Kornbau & J. Vesely. (2012). Creating an Automated Industry and Occupation Coding Process for the American Community Survey. Seattle : U.S Census Bureau.
  9. S. Wood, R. Muthyala, Y. Jin, Y. Qin, N. Rukadikar, A. Rai & H. Gao. (2017, December). Automated industry classification with deep learning. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 122-129). IEEE. DOI : 10.1109/bigdata.2017.8257920
  10. K. He, X. Zhang, S. Ren & J. Sun. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). DOI : 10.1109/cvpr.2016.90
  11. J. S. Lee, S. P, Jun, & H. S. Yoo. (2018). A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification. Journal of Intelligence and Information Systems, 24(3), 221-241 DOI : 10.13088/jiis.2018.24.3.221
  12. S. Hochreiter & J. Schmidhuber. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. DOI : 10.1162/neco.1997.9.8.1735
  13. S. M. Park, C. W. Na, M. S. Choi, D. H, Lee & B. W. On. (2018). KNU Korean Sentiment Lexicon - Bi-LSTM-based Method for Building a Korean Sentiment Lexicon -. Journal of Intelligence and Information Systems, 24(4), 219-240. DOI : 10.13088/jiis.2018.24.4.219
  14. M. S. Choi, & B. W. On. (2019). A Comparative Study on the Accuracy of Sentiment Analysis of Bi-LSTM Model by Morpheme Feature. Proceedings of KIIT Conference, 2019(6), 307-309.
  15. Y. T. Oh, M. T. Kim & W. J. Kim (2019). Korean Movie-review Sentiment Analysis Using Parallel Stacked Bidirectional LSTM Model. Journal of KIISE, 46(1), 45-49 DOI : 10.5626/JOK.2019.46.1.45
  16. J. Devlin, M. W. Chang, K. Lee & K. Toutanova. (2019, June). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186). DOI : 10.18653/v1/N19-1423
  17. H. J. Park & K. S, Shin. (2020). Aspect-Based Sentiment Analysis Using BERT: Developing Aspect Category Sentiment Classification Models. Journal of Intelligence and Information Systems, 26(4), 1-25 DOI : 10.13088/jiis.2020.26.4.001
  18. K. H. Kim, C. E. Park, C. K. Lee, & H. K. Kim. (2020). Korean End-to-end Neural Coreference Resolution with BERT. Journal of KIISE, 47(10), 942-947. DOI : 10.5626/JOK.2020.47.10.942
  19. Y. S. Choi & K. J. Lee. (2020). Performance Analysis of Korean Morphological Analyzer based on Transformer and BERT. Journal of KIISE, 47(8), 730-741. DOI : 10.5626/JOK.2020.47.8.730
  20. T. Kudo & J. Richardson. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71). DOI : 10.18653/v1/D18-2012