Abbreviation Disambiguation using Topic Modeling

Woon-Kyo Lee;Ja-Hee Kim;Junki Yang;

doi:10.9709/JKSS.2023.32.1.035

한국시뮬레이션학회논문지 (Journal of the Korea Society for Simulation)

제32권1호
/
Pages.35-44
/
2023
/
1225-5904(pISSN)

한국시뮬레이션학회 (The Korea Society for Simulation)

DOI QR Code

토픽모델링을 이용한 약어 중의성 해소

Abbreviation Disambiguation using Topic Modeling

이운교 ((주)한국금융아이티 ) ;
김자희 ;
양준기 (SK(주)C&C Cloud Transformation 그룹)

Woon-Kyo Lee ;
Ja-Hee Kim (Graduate school of Public Policy and Information Technology, Seoul National University of Science & Technology) ;
Junki Yang

투고 : 2022.11.24
심사 : 2023.02.27
발행 : 2023.03.31

https://doi.org/10.9709/JKSS.2023.32.1.035 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

최근 텍스트 분석으로 트렌드 분석이나 연구 동향 분석을 하는 연구 사례가 많다. 텍스트 분석을 위한 자료 수집에 사용되는 검색어가 약어일 때 약어의 특성상 의미 중의성 해소가 필요하다. 다수의 연구에서는 연구에 필요한 자료를 찾기 위해 수작업으로 자료를 하나씩 읽어 문서를 분류하고 있다. 약어의 의미 중의성 해소를 위한 연구는 단어의 의미를 명확화하는 연구가 대부분이고 지도학습을 이용하고 있다. 약어 중의성 해소를 위한 선행 방법은 약어로 검색된 자료에서 연구 대상 자료를 찾는 문서 분류에는 적합하지 않으며 관련 연구도 부족하다. 본 연구에서는 데이터 전처리 단계에서 비지도 학습 방법인 비음수 행렬 분해 방법으로 토픽 모델링을 진행하여 약어로 수집된 문서를 반자동으로 분류하는 방법을 제시한다. 이를 검증하기 위해 'MSA'라는 약어 검색어로 학술 데이터베이스에서 논문 자료를 수집했다. 수집된 논문 1,401편에서 제안된 방법으로 316편의 Micro Services Architecture와 관련된 논문을 찾았다. 제안된 방법의 문서 분류 정확도는 92.36%로 측정되었다. 제안된 방법이 수작업에 따른 연구자의 시간과 비용을 줄일 수 있기를 기대한다.

In recent, there are many research cases that analyze trends or research trends with text analysis. When collecting documents by searching for keywords in abbreviations for data analysis, it is necessary to disambiguate abbreviations. In many studies, documents are classified by hand-work reading the data one by one to find the data necessary for the study. Most of the studies to disambiguate abbreviations are studies that clarify the meaning of words and use supervised learning. The previous method to disambiguate abbreviation is not suitable for classification studies of documents looking for research data from abbreviation search documents, and related studies are also insufficient. This paper proposes a method of semi-automatically classifying documents collected by abbreviations by going topic modeling with Non-Negative Matrix Factorization, an unsupervised learning method, in the data pre-processing step. To verify the proposed method, papers were collected from academic DB with the abbreviation 'MSA'. The proposed method found 316 papers related to Micro Services Architecture in 1,401 papers. The document classification accuracy of the proposed method was measured at 92.36%. It is expected that the proposed method can reduce the researcher's time and cost due to hand work.

키워드

참고문헌

Bevilacqua, M., Pasini, T., Raganato, A., and Navigli, R. (2021), Recent trends in word sense disambiguation: A survey, In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conference on Artificial Intelligence, Inc.
Charbonnier, J., and Wartena, C. (2018), Using word embeddings for unsupervised acronym disambiguation, In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2610-2619.
Chung, M. S., Park, S. H., Chae, B. H., and Lee, J. y. (2017), Analysis of major research trends in artificial intelligence through analysis of thesis data, Journal of Digital Convergence, 15(5), 225-233. https://doi.org/10.14400/JDC.2017.15.2.225
Ciosici, M., Sommer, T., and Assent, I. (2019), Unsupervised Abbreviation Disambiguation Contextual disambiguation using word embeddings, arXiv preprint arXiv:1904.00929.
Kang, S., Chang, S. R. and Suh, Y. (2021), Machine Learning Approach to Classifying Fatal and NonFatal Accidents in Industries, Journal of the Korean Society of Safety, 36(5), pp. 52-60.
Kim, H. (2021), Analysis of the Research Trends on Business Archives: Focusing on the Topic Modeling Analysis, Journal of Korean Society of Archives and Records Management, 21(3), 163-186.
Kim, J. (2017), Keyword and topic analysis on the college and university structural reform evaluation using big data, Ph.D.Thesis, Seoul National University.
Kim, M. and Kwon, H.-C. (2021), Word Sense Disambiguation Using Prior Probability Estimation Based on the Korean WordNet, Electronics, 10, 2938. https://doi.org/10.3390/electronics10232938
Kuang, D., Choo, J. and Park, H. (2015), Nonnegative matrix factorization for interactive topic modeling and document clustering, In Partitional clustering algorithms, Springer, Cham, pp. 215-243.
Lee, D. and Seung, H. S. (2000), Algorithms for nonnegative matrix factorization, Advances in neural information processing systems, 13.
Lee, J. H., Lee, M., and Kim, J. W. (2019). A study on Korean language processing using TF-IDF, The Journal of Information Systems, 28(3), pp. 105-121.
Latif, S., Shafait, F., & Latif, R. (2021). Analyzing LDA and NMF Topic Models for Urdu Tweets via Automatic Labeling. IEEE Access, 9, pp. 127531-127547. https://doi.org/10.1109/ACCESS.2021.3112620
Li, C., Ji, L., and Yan, J. (2015), Acronym disambiguation using word embedding, In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29, No. 1.
Mifrah, S. and Benlahmar, E. H. (2020), Topic modeling coherence: A comparative study between LDA and NMF models using COVID'19 corpus, International Journal of Advanced Trends in Computer Science and Engineering, pp. 5756-5761.
Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011), Optimizing semantic coherence in topic models, In Proceedings of the 2011 conference on empirical methods in natural language processing, pp. 262-272.
Na, S. T., Kim, J. H., Jung, M. H., and Ahn, J. E. (2016), Trend Analysis using Topic Modeling for Simulation Studies, Journal of The Korea Society for Simulation, Vol. 25, No. 3, pp. 107-116. https://doi.org/10.9709/JKSS.2016.25.3.107
Navigli, R. (2009), Word sense disambiguation: A survey, ACM computing surveys (CSUR), 41(2), pp. 1-69. https://doi.org/10.1145/1459352.1459355
Park, S. (2007), Automatic Document Summarization Using Non-negative Matrix Factorization, Ph.D.Thesis, Inha University.
Salton, G. and Buckley, C. (1988), Term-weighting approaches in automatic text retrieval, Information processing & management, 24(5), pp. 513-523.
Song, M. (2017), Text Mining. Seoul: Chungram.
S. Wild, J. Curry and A. Dougherty (2003), Motivating Non Negative Matrix Factorizations, In proceedings of Society for Industrial and Applied Mathematics Conference on American Library Association (ALA'03).
Xu, W., Liu, X. and Gong, Y. (2003), Document clustering based on non-negative matrix factorization, In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267-273.
Zhong, Q., Zeng, G., Zhu, D., Zhang, Y., Lin, W., Chen, B., and Tang, J. (2021), Leveraging Domain Agnostic and Specific Knowledge for Acronym Disambiguation, arXiv preprint arXiv:2107.00316.

한국시뮬레이션학회논문지 (Journal of the Korea Society for Simulation)

토픽모델링을 이용한 약어 중의성 해소

Abbreviation Disambiguation using Topic Modeling

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)