DOI QR코드

DOI QR Code

BART 기반 문서 요약을 통한 토픽 모델링 성능 향상

Performance Improvement of Topic Modeling using BART based Document Summarization

  • 김은수 ;
  • 유현 ;
  • 정경용
  • Eun Su Kim (Division of AI Computer Science and Engineering, Kyonggi University) ;
  • Hyun Yoo (Contents Convergence Software Research Institute, Kyonggi University) ;
  • Kyungyong Chung (Division of AI Computer Science and Engineering, Kyonggi University)
  • 투고 : 2023.11.28
  • 심사 : 2024.03.12
  • 발행 : 2024.06.30

초록

정보의 증가 속에서 학문 연구의 환경은 지속적으로 변화하고 있으며, 이에 따라 대량의 문서를 효과적으로 분석하는 방법의 필요성이 대두된다. 본 연구에서는 BART(Bidirectional and Auto-Regressive Transformers) 기반의 문서 요약 모델을 사용하여 텍스트를 정제하여 핵심 내용을 추출하고, 이를 LDA(Latent Dirichlet Allocation) 알고리즘을 통한 토픽 모델링의 성능 향상 방법을 제시한다. 이는 문서 요약을 통해 LDA 토픽 모델링의 성능과 효율성을 향상시키는 접근법을 제안하고 실험을 통해 검증한다. 실험 결과, 논문 데이터를 요약하는 BART 기반 모델은 Rouge-1, Rouge-2, Rouge-L 성능 평가에서 각각 0.5819, 0.4384, 0.5038의 F1-Score를 나타내어 원문의 중요 정보를 포착하고 있음을 보인다. 또한, 요약된 문서를 사용한 토픽 모델링은 Perplexity 지표를 통한 성능 비교에서 원문을 사용한 토픽 모델링의 경우보다 약 8.08% 더 높은 성능을 보인다. 이는 토픽 모델링 과정에서 데이터 처리량의 감소와 효율성 향상에 기여한다.

The environment of academic research is continuously changing due to the increase of information, which raises the need for an effective way to analyze and organize large amounts of documents. In this paper, we propose Performance Improvement of Topic Modeling using BART(Bidirectional and Auto-Regressive Transformers) based Document Summarization. The proposed method uses BART-based document summary model to extract the core content and improve topic modeling performance using LDA(Latent Dirichlet Allocation) algorithm. We suggest an approach to improve the performance and efficiency of LDA topic modeling through document summarization and validate it through experiments. The experimental results show that the BART-based model for summarizing article data captures the important information of the original articles with F1-Scores of 0.5819, 0.4384, and 0.5038 in Rouge-1, Rouge-2, and Rouge-L performance evaluations, respectively. In addition, topic modeling using summarized documents performs about 8.08% better than topic modeling using full text in the performance comparison using the Perplexity metric. This contributes to the reduction of data throughput and improvement of efficiency in the topic modeling process.

키워드

과제정보

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(No. 2020R1A6A1A03040583)

참고문헌

  1. P. Kherwa and P. Bansal, "Topic modeling: a Comprehensive Review," EAI Endorsed transactions on scalable information systems, 2019. http://dx.doi.org/10.4108/eai.13-7-2018.159623
  2. J. Qiang, Z. Qian, Y. Li, Y. Yuan and X. Wu, "Short Text Topic Modeling Techniques, Applications, and Performance: A Survey," IEEE Transactions on Knowledge and Data Engineering, Vol. 34, No. 3, pp. 1427-1445, 2022. http://dx.doi.org/10.1109/TKDE.2020.2992485
  3. WS. El-Kassas, CR. Salama, AA, Rafea and HK. Mohamed, "Automatic Text Summarization: A Comprehensive Survey," Expert Systems with Applications, Vol. 165, 113679, 2021. https://doi.org/10.1016/j.eswa.2020.113679
  4. H. Yoo, R. C. Park, and K. Chung, "IoT-Based Health Big-Data Process Technologies: A Survey," KSII Transactions on Internet and Information Systems, Vol. 15, No. 3, pp. 974-992, 2021. https://doi.org/10.3837/tiis.2021.03.009
  5. B. Jeon, K Chung, "CutPaste-Based Anomaly Detection Model using Multi Scale Feature Extraction in Time Series Streaming Data," KSII Transactions on Internet and Information Systems, Vol. 16, No. 8, 2022. http://dx.doi.org/10.3837/tiis.2022.08.018
  6. J. Delvin, M.W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of naacL-HLT, Vol. 1, pp. 4171-4186, 2019. http://dx.doi.org/10.18653/v1/N19-1423
  7. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and Luke Zettlemoyer, "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension," Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871-7880, 2020. http://dx.doi.org/10.18653/v1/2020.acl-main.703
  8. H. Yoo, K. Chung, "Deep Learning-based Evolutionary Recommendation Model for heterogeneous Big Data Integration," KSII Transactions on Internet and Information Systems, Vol. 14, No. 9, pp. 3730-3744, 2020. https://doi.org/10.3837/tiis.2020.09.009
  9. D. O'callaghan, D. Greene, J. Carthy, and P. Cunningham, "An analysis of the coherence of descriptors in topic modeling," Expert Systems with Applications, Vol. 42, pp. 5645-5657, 2015. https://doi.org/10.1016/j.eswa.2015.02.055
  10. AI Hub, [Online] : https://aihub.or.kr/, 2023.
  11. KoBART, [Online] : https://github.com/SKT-AI/KoBART, 2023.
  12. H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, and L. Zhao, "Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey," Multimedia Tools and Applications, 78, 15169-15211, 2019. https://doi.org/10.1007/s11042-018-6894-4
  13. K. Park, J. Lee, S. Jang, and D. Jung, "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks," Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 133-142, 2020. https://doi.org/10.48550/arXiv.2010.02534