DOI QR코드

DOI QR Code

Improvement of topic modeling and case analysis through convergence of Bertopic and TextRank

버토픽과 텍스트랭크의 융합을 통한 토픽모델링의 개선 및 사례 분석

  • 김근형 (제주대학교 경영정보학과) ;
  • 강재정 (제주대학교 경영학과)
  • Received : 2024.07.30
  • Accepted : 2024.08.31
  • Published : 2024.09.30

Abstract

Purpose The purpose of this paper is to develop a method to improve topic representation by incorporating the TextRank technique in Bertopic-based topic modeling and additional indicators for determining the optimal number of topics. Design/methodology/approach In this paper, we propose a method to extract important documents from documents assigned to each topic of a topic model using the TextRank technique, and to calculate secondary diversity and generate topic representations based on the results. First, we integrate the TextRank algorithm into the Bertopic-based topic modeling process to set local secondary labels for each topic. The secondary labels of each topic are derived through extractive summarization based on the TextRank algorithm. Second, we improve the accuracy of selecting the optimal number of topics by calculating the secondary diversity index based on the extractive summary results of each topic. Third, we improve the efficiency by utilizing ChatGPT when deriving the labels of each topic. Findings As a result of performing case analysis and analysis evaluation using the proposed method, it was confirmed that topic representation based on TextRank results generated more accurate topic labels and that the secondary diversity index was a more effective index for determining the optimal number of topics.

Keywords

Acknowledgement

이 논문은 2024학년도 제주대학교 교원성과지원사업에 의하여 연구되었음.

References

  1. 김근형, "귀납적 사회과학연구 방법론을 위한 토픽모델링의 확장 및 사례분석,", 정보시스템연구, 제31권, 제4호, 2022, pp.25-45.
  2. 김효곤, 유동희, "BERT를 활용한 미국 기업 공시에 대한 감성분석 및 시각화,", 정보시스템연구, 제31권, 제3호, 2022, pp.67-87.
  3. 박상언, 강주영, "ChatGPT 및 거대언어모델의 추론 능력 향상을 위한 프롬프트 엔지니어링 방법론 및 연구현황 분석," 지능정보연구, 제29권, 제4호, 2023, pp.287-306.
  4. Cedric Fevotte and Jerome Idie, "Algorithms for nonnegative matrix factorization with the - divergence," Neural computation, Vol.23, No.2, 2011, pp.2421-2456.
  5. Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim, "On the surprising behavior of distance metrics in high dimensional space," In International conference on database theory, 2001, pp.420-434.
  6. David M Blei, Andrew Y Ng, and Michael I Jordan, "Latent dirichlet allocation," Journal of machine Learning research 3, 2003, pp.993-1022.
  7. Devlin Jacob, Chang Ming-Wei, Lee Kenton and Toutanova, Kristina. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv:1810.04805v2 [cs.CL], 2018, pp.1-16.
  8. Gerlof Bouma, "Normalized (pointwise) mutual information in collocation extraction," Proceedings of GSCL 30, 2009, pp.31-40.
  9. L. McInnes, J. Healy, and J. Melville, "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction," arXiv:1802.03426 e-prints, 2018, pp.1-63.
  10. L. McInnes and John Healy, "Accelerated hierarchical density based clustering." Data Mining Workshops (ICDMW) In IEEE International Conference, 2017, pp.33-42.
  11. Maarten Grootendorst, "BERTopic: Neural topic modeling with a class-based TF-IDF procedure," arXiv:2203.05794 [cs.CL], 2023, pp.1-10.
  12. Nils Reimers and Iryna Gurevych, Sentencebert, "Sentence embeddings using siamese bertnetworks," In Proceedings of Conference on Empirical Methods in Natural Language Processing, 2019, pp.3982-3991.
  13. Rada Mihalcea and Paul Tarau, "TextRank: Bringing Order into Text," Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004, pp.404-411.
  14. Silvia Terragni, Elisabetta Fersini and Enza Messina, "Word Embedding-Based Topic Similarity Measures," Natural Language Processing and Information Systems: 26th International Conference on Applications of Natural Language to Information Systems, 2021, pp.33-45.