DOI QR코드

DOI QR Code

토픽모델링을 이용한 약어 중의성 해소

Abbreviation Disambiguation using Topic Modeling

  • 투고 : 2022.11.24
  • 심사 : 2023.02.27
  • 발행 : 2023.03.31

초록

최근 텍스트 분석으로 트렌드 분석이나 연구 동향 분석을 하는 연구 사례가 많다. 텍스트 분석을 위한 자료 수집에 사용되는 검색어가 약어일 때 약어의 특성상 의미 중의성 해소가 필요하다. 다수의 연구에서는 연구에 필요한 자료를 찾기 위해 수작업으로 자료를 하나씩 읽어 문서를 분류하고 있다. 약어의 의미 중의성 해소를 위한 연구는 단어의 의미를 명확화하는 연구가 대부분이고 지도학습을 이용하고 있다. 약어 중의성 해소를 위한 선행 방법은 약어로 검색된 자료에서 연구 대상 자료를 찾는 문서 분류에는 적합하지 않으며 관련 연구도 부족하다. 본 연구에서는 데이터 전처리 단계에서 비지도 학습 방법인 비음수 행렬 분해 방법으로 토픽 모델링을 진행하여 약어로 수집된 문서를 반자동으로 분류하는 방법을 제시한다. 이를 검증하기 위해 'MSA'라는 약어 검색어로 학술 데이터베이스에서 논문 자료를 수집했다. 수집된 논문 1,401편에서 제안된 방법으로 316편의 Micro Services Architecture와 관련된 논문을 찾았다. 제안된 방법의 문서 분류 정확도는 92.36%로 측정되었다. 제안된 방법이 수작업에 따른 연구자의 시간과 비용을 줄일 수 있기를 기대한다.

In recent, there are many research cases that analyze trends or research trends with text analysis. When collecting documents by searching for keywords in abbreviations for data analysis, it is necessary to disambiguate abbreviations. In many studies, documents are classified by hand-work reading the data one by one to find the data necessary for the study. Most of the studies to disambiguate abbreviations are studies that clarify the meaning of words and use supervised learning. The previous method to disambiguate abbreviation is not suitable for classification studies of documents looking for research data from abbreviation search documents, and related studies are also insufficient. This paper proposes a method of semi-automatically classifying documents collected by abbreviations by going topic modeling with Non-Negative Matrix Factorization, an unsupervised learning method, in the data pre-processing step. To verify the proposed method, papers were collected from academic DB with the abbreviation 'MSA'. The proposed method found 316 papers related to Micro Services Architecture in 1,401 papers. The document classification accuracy of the proposed method was measured at 92.36%. It is expected that the proposed method can reduce the researcher's time and cost due to hand work.

키워드

참고문헌

  1. Bevilacqua, M., Pasini, T., Raganato, A., and Navigli, R. (2021), Recent trends in word sense disambiguation: A survey, In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conference on Artificial Intelligence, Inc.
  2. Charbonnier, J., and Wartena, C. (2018), Using word embeddings for unsupervised acronym disambiguation, In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2610-2619.
  3. Chung, M. S., Park, S. H., Chae, B. H., and Lee, J. y. (2017), Analysis of major research trends in artificial intelligence through analysis of thesis data, Journal of Digital Convergence, 15(5), 225-233.  https://doi.org/10.14400/JDC.2017.15.2.225
  4. Ciosici, M., Sommer, T., and Assent, I. (2019), Unsupervised Abbreviation Disambiguation Contextual disambiguation using word embeddings, arXiv preprint arXiv:1904.00929.
  5. Kang, S., Chang, S. R. and Suh, Y. (2021), Machine Learning Approach to Classifying Fatal and NonFatal Accidents in Industries, Journal of the Korean Society of Safety, 36(5), pp. 52-60.
  6. Kim, H. (2021), Analysis of the Research Trends on Business Archives: Focusing on the Topic Modeling Analysis, Journal of Korean Society of Archives and Records Management, 21(3), 163-186. 
  7. Kim, J. (2017), Keyword and topic analysis on the college and university structural reform evaluation using big data, Ph.D.Thesis, Seoul National University. 
  8. Kim, M. and Kwon, H.-C. (2021), Word Sense Disambiguation Using Prior Probability Estimation Based on the Korean WordNet, Electronics, 10, 2938. https://doi.org/10.3390/electronics10232938
  9. Kuang, D., Choo, J. and Park, H. (2015), Nonnegative matrix factorization for interactive topic modeling and document clustering, In Partitional clustering algorithms, Springer, Cham, pp. 215-243.
  10. Lee, D. and Seung, H. S. (2000), Algorithms for nonnegative matrix factorization, Advances in neural information processing systems, 13.
  11. Lee, J. H., Lee, M., and Kim, J. W. (2019). A study on Korean language processing using TF-IDF, The Journal of Information Systems, 28(3), pp. 105-121.
  12. Latif, S., Shafait, F., & Latif, R. (2021). Analyzing LDA and NMF Topic Models for Urdu Tweets via Automatic Labeling. IEEE Access, 9, pp. 127531-127547. https://doi.org/10.1109/ACCESS.2021.3112620
  13. Li, C., Ji, L., and Yan, J. (2015), Acronym disambiguation using word embedding, In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29, No. 1.
  14. Mifrah, S. and Benlahmar, E. H. (2020), Topic modeling coherence: A comparative study between LDA and NMF models using COVID'19 corpus, International Journal of Advanced Trends in Computer Science and Engineering, pp. 5756-5761.
  15. Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011), Optimizing semantic coherence in topic models, In Proceedings of the 2011 conference on empirical methods in natural language processing, pp. 262-272.
  16. Na, S. T., Kim, J. H., Jung, M. H., and Ahn, J. E. (2016), Trend Analysis using Topic Modeling for Simulation Studies, Journal of The Korea Society for Simulation, Vol. 25, No. 3, pp. 107-116.  https://doi.org/10.9709/JKSS.2016.25.3.107
  17. Navigli, R. (2009), Word sense disambiguation: A survey, ACM computing surveys (CSUR), 41(2), pp. 1-69. https://doi.org/10.1145/1459352.1459355
  18. Park, S. (2007), Automatic Document Summarization Using Non-negative Matrix Factorization, Ph.D.Thesis, Inha University. 
  19. Salton, G. and Buckley, C. (1988), Term-weighting approaches in automatic text retrieval, Information processing & management, 24(5), pp. 513-523.
  20. Song, M. (2017), Text Mining. Seoul: Chungram. 
  21. S. Wild, J. Curry and A. Dougherty (2003), Motivating Non Negative Matrix Factorizations, In proceedings of Society for Industrial and Applied Mathematics Conference on American Library Association (ALA'03).
  22. Xu, W., Liu, X. and Gong, Y. (2003), Document clustering based on non-negative matrix factorization, In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267-273.
  23. Zhong, Q., Zeng, G., Zhu, D., Zhang, Y., Lin, W., Chen, B., and Tang, J. (2021), Leveraging Domain Agnostic and Specific Knowledge for Acronym Disambiguation, arXiv preprint arXiv:2107.00316.