DOI QR코드

DOI QR Code

텍스트 분류 자동화를 위한 AutoML 웹 플랫폼 개발

Development of an AutoML Web Platform for Text Classification Automation

  • 송하윤 (한국로봇융합연구원 인공지능로봇연구본부 ) ;
  • 강전성 (한국로봇융합연구원 인공지능로봇연구본부 ) ;
  • 박범준 (한국로봇융합연구원 인공지능로봇연구본부 ) ;
  • 김준영 (한국로봇융합연구원 인공지능로봇연구본부 ) ;
  • 전광우 (한국로봇융합연구원 인공지능로봇연구본부 ) ;
  • 윤준원 (한국로봇융합연구원 인공지능로봇연구본부 ) ;
  • 정현준 (한국로봇융합연구원 인공지능로봇연구본부 )
  • 투고 : 2024.08.01
  • 심사 : 2024.08.26
  • 발행 : 2024.10.31

초록

인공지능과 머신러닝 기술의 급격한 발전은 다양한 산업 분야에 혁신을 일으키고 있으며, 특히 자연어 처리(NLP) 기술은 텍스트 데이터 분석 및 처리에 새로운 가능성을 제공하고 있다. 텍스트 분류 모델을 효과적으로 개발하려면 데이터 탐색, 전처리, 특징 추출, 모델 선택, 하이퍼파라미터 튜닝, 성능 평가 등의 복잡한 단계를 거쳐야 하며, 이는 많은 시간과 전문 지식을 요구한다. 자동화된 머신러닝(AutoML)은 이러한 과정을 자동화하여 비전문가도 고성능 모델을 쉽게 생성할 수 있도록 돕는다. 그러나 기존 AutoML 도구는 주로 정형 데이터에 특화되어 있어, 비정형 텍스트 데이터 처리에는 전처리와 특징 추출 과정에서 수작업이 필요하다. 본 연구에서는 이러한 한계를 해결하기 위해 텍스트 전처리, 단어 임베딩, 모델 학습 및 평가 과정을 자동화하는 웹 기반 AutoML 플랫폼을 개발하였다. 이 플랫폼은 사용자가 텍스트 데이터를 업로드하면 최적의 머신러닝 모델을 자동으로 생성하고 성능을 시각적으로 제공함으로써 텍스트 분류 작업의 효율성을 크게 향상시킨다. 다양한 텍스트 분류 데이터셋을 활용한 실험 결과, 제안된 플랫폼은 높은 정확도와 정밀도를 보였으며, 특히 Stacked Ensemble 모델 사용 시 우수한 성능을 나타냈다. 본 연구는 텍스트 분류 자동화를 통해 비전문가도 손쉽게 텍스트 데이터를 분석하고 활용할 수 있는 가능성을 제시하며, 향후 대규모 언어 모델(LLM)을 적용하여 성능을 더욱 향상시킬 계획이다.

The rapid advancement of artificial intelligence and machine learning technologies is driving innovation across various industries, with natural language processing offering substantial opportunities for the analysis and processing of text data. The development of effective text classification models requires several complex stages, including data exploration, preprocessing, feature extraction, model selection, hyperparameter optimization, and performance evaluation, all of which demand significant time and domain expertise. Automated machine learning (AutoML) aims to automate these processes, thus allowing practitioners without specialized knowledge to develop high-performance models efficiently. However, current AutoML frameworks are primarily designed for structured data, which presents challenges for unstructured text data, as manual intervention is often required for preprocessing and feature extraction. To address these limitations, this study proposes a web-based AutoML platform that automates text preprocessing, word embedding, model training, and evaluation. The proposed platform substantially enhances the efficiency of text classification workflows by enabling users to upload text data, automatically generate the optimal ML model, and visually present performance metrics. Experimental results across multiple text classification datasets indicate that the proposed platform achieves high levels of accuracy and precision, with particularly notable performance when utilizing a Stacked Ensemble approach. This study highlights the potential for non-experts to effectively analyze and leverage text data through automated text classification and outlines future directions to further enhance performance by integrating Large language models.

키워드

과제정보

본 연구는 방위산업기술지원센터의 지원으로 수행되었음(UC200019D).

참고문헌

  1. M. Zareapoor and K. R. Seeja, "Feature extraction or feature selection for text classification: A case study on phishing email detection," International Journal of Information Engineering and Electronic Business, Vol.7, No.2, pp.60, 2015. 
  2. O. Bruna, H. Avetisyan, and J. Holub, "Emotion models for textual emotion classification," Journal of Physics: Conference Series, Vol.772, No.1, 2016. 
  3. T. Moh, T.-S. Teng, and Z. Zhang, "Cross-lingual text classification with model translation and document translation," in Proceedings of the 50th Annual Southeast Regional Conference, 2012. 
  4. R. Desai et al., "TextBrew: Automated Model Selection and Hyperparameter Optimization for Text Classification," International Journal of Advanced Computer Science and Applications, Vol.13, No.9, 2022. 
  5. X. He, K. Zhao, and X. Chu, "AutoML: A survey of the state-of-the-art," Knowledge-Based Systems, Vol.212, pp. 106622, 2021. 
  6. J. Han, K. S. Park, and K. M. Lee, "An automated machine learning platform for non-experts," in Proceedings of the International Conference on Research in Adaptive and Convergent Systems, 2020. 
  7. E. LeDell and S. Poirier, "H2o automl: Scalable automatic machine learning," in Proceedings of the AutoML Workshop at ICML, Vol.2020, San Diego, CA, USA: ICML, 2020. 
  8. S. Brandle et al., "Evaluation of representation models for text classification with AutoML tools," in Proceedings of the Future Technologies Conference (FTC) 2021, Vol.2, Springer International Publishing, 2022. 
  9. K. Kowsari et al., "Text classification algorithms: A survey," Information, Vol.10, No.4, pp.150, Apr. 2019. 
  10. M. M. Mironczuk and J. Protasiewicz, "A recent overview of the state-of-the-art elements of text classification," Expert Systems with Applications, Vol.106, pp.36-54, 2018. 
  11. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013. 
  12. J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), 2014. 
  13. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. 
  14. R. Li, "A review of machine learning algorithms for text classification," Cyber Security, Vol.226, 2022. 
  15. M. Lewis, "Bart: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension," arXiv preprint arXiv:1910.13461, 2019. 
  16. T. B. Brown, "Language models are few-shot learners," arXiv preprint arXiv:2005.14165, 2020. 
  17. Y. Chae and T. Davidson, "Large language models for text classification: From zero-shot learning to fine-tuning," Open Science Foundation, 2023. 
  18. F. Stoica and L. F. Stoica, "AutoML Insights: Gaining Confidence to Operationalize Predictive Models," 2024. 
  19. O. Levy, Y. Goldberg, and I. Dagan, "Improving distributional similarity with lessons learned from word embeddings," Transactions of the Association for Computational Linguistics, Vol.3, pp.211-225, 2015. 
  20. T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 
  21. J. H. Friedman, "Greedy function approximation: a gradient boosting machine," Annals of Statistics, Vol.29, No.5, pp.1189-1232, 2001. 
  22. N. E. Breslow, "Generalized linear models: checking assumptions and strengthening conclusions," Statistica Applicata, Vol.8, No.1, pp.23-41, 1996. 
  23. A. Candel et al., "Deep learning with H2O," H2O.ai Inc., pp.1-21, 2016. 
  24. P. Geurts, D. Ernst, and L. Wehenkel, "Extremely randomized trees," Machine Learning, Vol.63, pp.3-42, 2006.