DOI QR코드

DOI QR Code

A study on the classification of research topics based on COVID-19 academic research using Topic modeling

토픽모델링을 활용한 COVID-19 학술 연구 기반 연구 주제 분류에 관한 연구

  • Received : 2021.12.30
  • Accepted : 2022.02.01
  • Published : 2022.03.31

Abstract

From January 2020 to October 2021, more than 500,000 academic studies related to COVID-19 (Coronavirus-2, a fatal respiratory syndrome) have been published. The rapid increase in the number of papers related to COVID-19 is putting time and technical constraints on healthcare professionals and policy makers to quickly find important research. Therefore, in this study, we propose a method of extracting useful information from text data of extensive literature using LDA and Word2vec algorithm. Papers related to keywords to be searched were extracted from papers related to COVID-19, and detailed topics were identified. The data used the CORD-19 data set on Kaggle, a free academic resource prepared by major research groups and the White House to respond to the COVID-19 pandemic, updated weekly. The research methods are divided into two main categories. First, 41,062 articles were collected through data filtering and pre-processing of the abstracts of 47,110 academic papers including full text. For this purpose, the number of publications related to COVID-19 by year was analyzed through exploratory data analysis using a Python program, and the top 10 journals under active research were identified. LDA and Word2vec algorithm were used to derive research topics related to COVID-19, and after analyzing related words, similarity was measured. Second, papers containing 'vaccine' and 'treatment' were extracted from among the topics derived from all papers, and a total of 4,555 papers related to 'vaccine' and 5,971 papers related to 'treatment' were extracted. did For each collected paper, detailed topics were analyzed using LDA and Word2vec algorithms, and a clustering method through PCA dimension reduction was applied to visualize groups of papers with similar themes using the t-SNE algorithm. A noteworthy point from the results of this study is that the topics that were not derived from the topics derived for all papers being researched in relation to COVID-19 (

) were the topic modeling results for each research topic (
) was found to be derived from For example, as a result of topic modeling for papers related to 'vaccine', a new topic titled Topic 05 'neutralizing antibodies' was extracted. A neutralizing antibody is an antibody that protects cells from infection when a virus enters the body, and is said to play an important role in the production of therapeutic agents and vaccine development. In addition, as a result of extracting topics from papers related to 'treatment', a new topic called Topic 05 'cytokine' was discovered. A cytokine storm is when the immune cells of our body do not defend against attacks, but attack normal cells. Hidden topics that could not be found for the entire thesis were classified according to keywords, and topic modeling was performed to find detailed topics. In this study, we proposed a method of extracting topics from a large amount of literature using the LDA algorithm and extracting similar words using the Skip-gram method that predicts the similar words as the central word among the Word2vec models. The combination of the LDA model and the Word2vec model tried to show better performance by identifying the relationship between the document and the LDA subject and the relationship between the Word2vec document. In addition, as a clustering method through PCA dimension reduction, a method for intuitively classifying documents by using the t-SNE technique to classify documents with similar themes and forming groups into a structured organization of documents was presented. In a situation where the efforts of many researchers to overcome COVID-19 cannot keep up with the rapid publication of academic papers related to COVID-19, it will reduce the precious time and effort of healthcare professionals and policy makers, and rapidly gain new insights. We hope to help you get It is also expected to be used as basic data for researchers to explore new research directions.

2020년 1월부터 2021년 10월 현재까지 COVID-19(치명적인 호흡기 증후군인 코로나바이러스-2)와 관련된 학술 연구가 500,000편 이상 발표되었다. COVID-19와 관련된 논문의 수가 급격하게 증가함에 따라 의료 전문가와 정책 담당자들이 중요한 연구를 신속하게 찾는 것에 시간적·기술적 제약이 따르고 있다. 따라서 본 연구에서는 LDA와 Word2vec 알고리즘을 사용하여 방대한 문헌의 텍스트 자료로부터 유용한 정보를 추출하는 방안을 제시한다. COVID-19와 관련된 논문에서 검색하고자 하는 키워드와 관련된 논문을 추출하고, 이를 대상으로 세부 주제를 파악하였다. 자료는 Kaggle에 있는 CORD-19 데이터 세트를 활용하였는데, COVID-19 전염병에 대응하기 위해 주요 연구 그룹과 백악관이 준비한 무료 학술 자료로서 매주 자료가 업데이트되고 있다. 연구 방법은 크게 두 가지로 나뉜다. 먼저, 47,110편의 학술 논문의 초록을 대상으로 LDA 토픽 모델링과 Word2vec 연관어 분석을 수행한 후, 도출된 토픽 중 'vaccine'과 관련된 논문 4,555편, 'treatment'와 관련된 논문 5,791편을 추출한다. 두 번째로 추출된 논문을 대상으로 LDA, PCA 차원 축소 후 t-SNE 기법을 사용하여 비슷한 주제를 가진 논문을 군집화하고 산점도로 시각화하였다. 전체 논문을 대상으로 찾을 수 없었던 숨겨진 주제를 키워드에 따라 문헌을 분류하여 토픽 모델링을 수행한 결과 세부 주제를 찾을 수 있었다. 본 연구의 목표는 대량의 문헌에서 키워드를 입력하여 특정 정보에 대한 문헌을 분류할 수 있는 방안을 제시하는 것이다. 본 연구의 목표는 의료 전문가와 정책 담당자들의 소중한 시간과 노력을 줄이고, 신속하게 정보를 얻을 수 있는 방법을 제안하는 것이다. 학술 논문의 초록에서 COVID-19와 관련된 토픽을 발견하고, COVID-19에 대한 새로운 연구 방향을 탐구하도록 도움을 주는 기초자료로 활용될 것으로 기대한다.

Keywords

References

  1. Ahamed, S. and M. D. Samad, "Information mining for COVID-19 research from a large volume of scientific literature," Cornell University, 2020. Available at https://arxiv.org/abs/2004.02085/ (Downloaded 07 November, 2021).
  2. Alimadadi, A., S. Aryal, I. Manandhar, P. B. Munroe, B. Joe, and X. Cheng, "Artificial intelligence and machine learning to fight COVID," Physiol Genomics, Vol. 52, No. 4(2020), 200~202. https://doi.org/10.1152/physiolgenomics.00029.2020
  3. Anowar, F., S. Sadaoui, and B. Selim, "Conceptual and empirical comparison of dimensionality reduction algorithms(PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE)," Computer Science Review, Vol. 40(2021), 100378. https://doi.org/10.1016/j.cosrev.2021.100378
  4. Blei, D. M., A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, vol. 3(2003), 993-1022.
  5. Buljan, M., J. Nordqvist, and R. M. Martins, "An Investigation on the Impact of Non-Uniform Random Sampling Techniques for t-SNE", 2020 Swedish Workshop on Data Science (SweDS), (2020), 1~8.
  6. Chu, Y. J., Research papers pandemic brought by COVID-19, Medical Observer, 2021, Available at http://www.monews.co.kr/news/articleView.html?idxno=302210 (Downloaded 12 Nov, 2021)
  7. Eren, M. E., N. Solovyev, E. Raff, C. Nicholas, and B. Johnson, "COVID-19 Kaggle Literature Organization," Proceedings of the ACM Symposium on Document Engineering, 2020.
  8. Heo, S. M. and J. Y. Yang, "A Convergence Study on the Topic and Sentiment of COVID 19 Research in Korea Using Text Analysis," Journal of the Korea Convergence Society, Vol.12, No. 4(2021), 31~42. https://doi.org/10.15207/JKCS.2021.12.4.031
  9. Jelodar, H., Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, and L. Zhao, "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey," Cornell Univerity, 2018. Available at https://arxiv.org/abs/1711.04305/(Downloaded 11 November, 2021)
  10. Jo, S. W., Corona 19 fact check ㊻ fatal 'cytokine storm', hidoc, 2021. Available at https://www.hidoc.co.kr/healthstory/news/C0000595237/ (Downloaded 11 November, 2021).
  11. Jeong, J. Y., K. H. Mo, S. W. Seo, C. Y. Kim, H. D. Kim, and P. S. Kang, "Unsupervised Document Milti-Category Weight Extraction based on Word Embedding and Word Network Analysis : A Case Study on Mobile Phone Reviews," Journal of the Korean Institute of Industrial Engineers, Vol. 44, No. 6(2018), 442~451. https://doi.org/10.7232/jkiie.2018.44.6.442
  12. Kim, W. J., D. H. Kim, and H. W. Jang, "Semantic extention search for documents using the Word2vec," Journal of the Korea Contents Association, Vol. 16, No. 10(2016), 687~692 https://doi.org/10.5392/JKCA.2016.16.10.687
  13. Kim, T. K., W. S. Shon, and S. M. Jeon, "Mining Loot Box News: Analysis of Keyword Similarities Using Word2Vec," Journal of Information Technology Service, Vol. 20, No. 2(2021), 77~90. https://doi.org/10.9716/KITS.2021.20.2.077
  14. Kwon, C. M., Python MachineLearning Perfect Guide, Wikibooks, Seoul, Korea, 2020.
  15. Lee, Z. A., New Coronavirus without a cure, hepatitis C, Ebola, and AIDS treatments are on the rise, Dong-A Science, 2020. Available at https://www.dongascience.com/news.php?idx=34026/ (Downloaded 11 November, 2021).
  16. Liu, M. N. and G.G.Lim, "Word-of Mouth Effect for Online Sales of K-Beauty Products: Centered on China SINA Weibo and Meipai," Journal of Intelligence and Information System, Vol.25, No. 1(2019), 197-218. https://doi.org/10.13088/JIIS.2019.25.1.197
  17. Maaten, L.v.d., G. Hinton, "Visualizing data using t-SNE," Journal of Machine Learning Research, vol.9(2008), 2579~2625.
  18. Shin, E. J., "Recent Academic Publishing Trends through Bibliometric Analysis of COVID-19 Articles: Focused on Medicine and Life Science," Korean Biblia Society for Library and Information science, Vol. 32, No. 1(2021), 115~132.
  19. Shin, Y. S., New technology to quickly identify neutralizing antibodies against COVID-19, PharmNews, 2020, Available at https://www.pharmnews.com/news/articleView.html?idxno=100977/ (Downloaded 11 November, 2021).
  20. Shin, Y. S., COVID-19 fact check fata 'cytokine storm', Hidoc, Available at https://www.hidoc.co.kr/healthstory/news/C0000595237 (Downlo aded 12 Nov, 2021)
  21. Shon, E.S., S. J. Ahn, T. H. Ha, and B. Y. Coh, COVID-19 research trends seen through archive data, Korea Institute of science Technology Information, Available at http://mirian.kisti.re.kr/insight/insight.jsp (Downloaded 11 Nov, 2021).
  22. Vatsa, S., S. Marthur, M. Garg, and R. Jimdal, "COVID-19 Tweet Analysis using Hybrid Keyword Extraction Approach," 2021 10th IEEE International Conference on Communication Systems and Network Technologies(CSNT), 2021, 136~140.
  23. Verma, S. and A. Gustafsson, "Investigating the emerging COVID-19 research trends in the field of business and management: A bibliometric analysis approach, Journal of Business Research," Vol. 118(2020), 253~261. https://doi.org/10.1016/j.jbusres.2020.06.057
  24. Wang, Z., L. Ma, and Y. Zhang, "A Hybrid Document Feature Extraction Method Using Latent Dirichlet Allocation and Word2vec," 2016 IEEE First International Conference on Dat Science in Cyberspace(DSC), (2016), 98~103.
  25. Xia, C., T. He. W., Li, Z. Qin, and Z., Zou, "Similarity Analysis of Law Documents Based on Word2vec," 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion(QRS-C), (2019), 345~357.
  26. Yoo, S.Y., and G.G.Lim, "Ananlysis of News Agenda Using Text mining and Semantic Network Analysis: Focused on COVID-19 Emotions," Journal of Intelligence and Information System, Vol.27, No. 1(2021), 47-64. https://doi.org/10.13088/JIIS.2021.27.1.047
  27. Yoon, J.E., and C.J.Suh, "Research Trend Analysis by using Text-Mining Techniques on the Convergence Studies of AI and Healthcare Technologies," Journal of Information Technology Services, Vol.18, No. 2(2019), 123-141. https://doi.org/10.9716/KITS.2019.18.2.123
  28. Yu, D.S., and G.G.Lim, "A Study on the eWOM and Selecting Movie According to Online Media and Replies," Journal of Information Technology Services, Vol.14, No. 2(2015), 177-193. https://doi.org/10.9716/KITS.2015.14.2.177