DOI QR코드

DOI QR Code

Construction of Event Networks from Large News Data Using Text Mining Techniques

텍스트 마이닝 기법을 적용한 뉴스 데이터에서의 사건 네트워크 구축

  • Lee, Minchul (Graduate School of Library and Information Science, Yonsei University) ;
  • Kim, Hea-Jin (Institute of the Study for the Korean Modernity, Yonsei University)
  • 이민철 (연세대학교 문헌정보학과) ;
  • 김혜진 (연세대학교 근대한국학연구소)
  • Received : 2017.11.19
  • Accepted : 2018.03.16
  • Published : 2018.03.31

Abstract

News articles are the most suitable medium for examining the events occurring at home and abroad. Especially, as the development of information and communication technology has brought various kinds of online news media, the news about the events occurring in society has increased greatly. So automatically summarizing key events from massive amounts of news data will help users to look at many of the events at a glance. In addition, if we build and provide an event network based on the relevance of events, it will be able to greatly help the reader in understanding the current events. In this study, we propose a method for extracting event networks from large news text data. To this end, we first collected Korean political and social articles from March 2016 to March 2017, and integrated the synonyms by leaving only meaningful words through preprocessing using NPMI and Word2Vec. Latent Dirichlet allocation (LDA) topic modeling was used to calculate the subject distribution by date and to find the peak of the subject distribution and to detect the event. A total of 32 topics were extracted from the topic modeling, and the point of occurrence of the event was deduced by looking at the point at which each subject distribution surged. As a result, a total of 85 events were detected, but the final 16 events were filtered and presented using the Gaussian smoothing technique. We also calculated the relevance score between events detected to construct the event network. Using the cosine coefficient between the co-occurred events, we calculated the relevance between the events and connected the events to construct the event network. Finally, we set up the event network by setting each event to each vertex and the relevance score between events to the vertices connecting the vertices. The event network constructed in our methods helped us to sort out major events in the political and social fields in Korea that occurred in the last one year in chronological order and at the same time identify which events are related to certain events. Our approach differs from existing event detection methods in that LDA topic modeling makes it possible to easily analyze large amounts of data and to identify the relevance of events that were difficult to detect in existing event detection. We applied various text mining techniques and Word2vec technique in the text preprocessing to improve the accuracy of the extraction of proper nouns and synthetic nouns, which have been difficult in analyzing existing Korean texts, can be found. In this study, the detection and network configuration techniques of the event have the following advantages in practical application. First, LDA topic modeling, which is unsupervised learning, can easily analyze subject and topic words and distribution from huge amount of data. Also, by using the date information of the collected news articles, it is possible to express the distribution by topic in a time series. Second, we can find out the connection of events in the form of present and summarized form by calculating relevance score and constructing event network by using simultaneous occurrence of topics that are difficult to grasp in existing event detection. It can be seen from the fact that the inter-event relevance-based event network proposed in this study was actually constructed in order of occurrence time. It is also possible to identify what happened as a starting point for a series of events through the event network. The limitation of this study is that the characteristics of LDA topic modeling have different results according to the initial parameters and the number of subjects, and the subject and event name of the analysis result should be given by the subjective judgment of the researcher. Also, since each topic is assumed to be exclusive and independent, it does not take into account the relevance between themes. Subsequent studies need to calculate the relevance between events that are not covered in this study or those that belong to the same subject.

전통적으로 신문 매체는 국내외에서 발생하는 사건들을 살피는 데에 가장 적합한 매체이다. 최근에는 정보통신 기술의 발달로 온라인 뉴스 매체가 다양하게 등장하면서 주변에서 일어나는 사건들에 대한 보도가 크게 증가하였고, 이것은 독자들에게 많은 양의 정보를 보다 빠르고 편리하게 접할 기회를 제공함과 동시에 감당할 수 없는 많은 양의 정보소비라는 문제점도 제공하고 있다. 본 연구에서는 방대한 양의 뉴스기사로부터 데이터를 추출하여 주요 사건을 감지하고, 사건들 간의 관련성을 판단하여 사건 네트워크를 구축함으로써 독자들에게 현시적이고 요약적인 사건정보를 제공하는 기법을 제안하는 것을 목적으로 한다. 이를 위해 2016년 3월에서 2017년 3월까지의 한국 정치 및 사회 기사를 수집하였고, 전처리과정에서 NPMI와 Word2Vec 기법을 활용하여 고유명사 및 합성명사와 이형동의어 추출의 정확성을 높였다. 그리고 LDA 토픽 모델링을 실시하여 날짜별로 주제 분포를 계산하고 주제 분포의 최고점을 찾아 사건을 탐지하는 데 사용하였다. 또한 사건 네트워크를 구축하기 위해 탐지된 사건들 간의 관련성을 측정을 위하여 두 사건이 같은 뉴스 기사에 동시에 등장할수록 서로 더 연관이 있을 것이라는 가정을 바탕으로 코사인 유사도를 확장하여 관련성 점수를 계산하는데 사용하였다. 최종적으로 각 사건은 각의 정점으로, 그리고 사건 간의 관련성 점수는 정점들을 잇는 간선으로 설정하여 사건 네트워크를 구축하였다. 본 연구에서 제시한 사건 네트워크는 1년간 한국에서 발생했던 정치 및 사회 분야의 주요 사건들이 시간 순으로 정렬되었고, 이와 동시에 특정 사건이 어떤 사건과 관련이 있는지 파악하는데 도움을 주었다. 또한 일련의 사건들의 시발점이 되는 사건이 무엇이었는가도 확인이 가능하였다. 본 연구는 텍스트 전처리 과정에서 다양한 텍스트 마이닝 기법과 새로이 주목받고 있는 Word2vec 기법을 적용하여 봄으로써 기존의 한글 텍스트 분석에서 어려움을 겪고 있었던 고유명사 및 합성명사 추출과 이형동의어의 정확도를 높였다는 것에서 학문적 의의를 찾을 수 있다. 그리고, LDA 토픽 모델링을 활용하기에 방대한 양의 데이터를 쉽게 분석 가능하다는 것과 기존의 사건 탐지에서는 파악하기 어려웠던 사건 간 관련성을 주제 동시출현을 통해 파악할 수 있다는 점에서 기존의 사건 탐지 방법과 차별화된다.

Keywords

References

  1. Atefeh, F., and W. Khreich., "A survey of techniques for event detection in twitter," Computational Intelligence, Vol. 31, No. 1 (2015), 132-164.
  2. Bae, J, H., N. G. Han and M Song, "Twitter Issue Tracking System by Topic Modeling Techniques," Journal of Intelligence and Information Systems, Vol. 20, No. 2 (2014), 109-122. https://doi.org/10.13088/JIIS.2014.20.2.109
  3. Blei, D. M., A. Y. Ng, and M. I. Jordan, "Latent Dirichlet allocation," Journal of machine Learning research, Vol. 3, No. 1 (2003), 993-1022.
  4. Bouma, G. "Normalized (pointwise) mutual information in collocation extraction," Proceedings of the Biennial GSCL Conference Vol. 156. (2009), 31-40.
  5. Chae S. H., J. I. Lim and J Kang, "A Comparative Analysis of Social Commerce and Open Market Using User Reviews in Korean Mobile Commerce," Journal of Intelligence and Information Systems, Vol. 21, No. 4 (2015), 53-77. https://doi.org/10.13088/JIIS.2015.21.4.053
  6. Goldberg, Y., and O. Levy, "Word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method," arXiv preprint arXiv:1402.3722. 2014.
  7. Ha-Thuc, V., Y. Mejova, C. Harris, and P. Srinivasan, "A relevance-based topic model for news event tracking." Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, (2009) 764-765.
  8. He, Q., K. Chang, E. P. Lim, and J. Zhang, "Bursty feature representation for clustering text streams," Proceedings of the 2007 SIAM International Conference on Data Mining, (2007), 491-496.
  9. Jeong, H., "A Study on Ontology and Topic Modeling-based Multi-dimensional Knowledge Map Services," Journal of Intelligence and Information Systems, Vol. 21, No. 4 (2015), 79-92. https://doi.org/10.13088/JIIS.2015.21.4.079
  10. Kleinberg, J. "Bursty and hierarchical structure in streams," Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (2002), 91-101.
  11. Kumaran, G., and J. Allan, "Text classification and named entities for new event detection," Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. (2004), 297-304.
  12. Lee, J. Y., "A Study on Relative Mutual Information Coefficients," Journal of the Korean Society for Library and Information Science, Vol. 34., No. 4 (2003), 177-198.
  13. Oh, H. J., B. H. Yun, C. J. Yoo, and Y. Kim, "Trend Analysis using Spatial-Temporal Visualization of Event Information based on Social Media," Journal of Internet Computing and Services, Vol. 15, No. 6 (2014), 65-75. https://doi.org/10.7472/jksii.2014.15.6.65
  14. Qian, S., T. Zhang, C. Xu, and J. Shao, "Multi-modal event topic model for social event analysis." IEEE Transactions on Multimedia, Vol. 18, No. 2 (2016), 233-246. https://doi.org/10.1109/TMM.2015.2510329
  15. Salton, G. "Automatic text processing. Reading." MA: Addison-Wesley. 1989.
  16. Tsolmon, B. "Extracting Social Events based on LDA Topic Model with Timeline and User Behaviour Analysis in Twitter Corpus," MS Thesis, Chonbuk University, 2013.
  17. Van de Cruys, T. "Two multivariate generalizations of pointwise mutual information," Proceedings of the Workshop on Distributional Semantics and Compositionality, (2011), 16-20.

Cited by

  1. 텍스트마이닝을 활용한 북한 지도자의 신년사 및 연설문 트렌드 연구 vol.26, pp.3, 2018, https://doi.org/10.21219/jitam.2019.26.3.043
  2. Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구 vol.26, pp.1, 2020, https://doi.org/10.13088/jiis.2020.26.1.001
  3. 뉴스 데이터를 활용한 텍스트 감성분석에 따른 지역 산업생태계 위기 예측 - 광주 지역 자동차 산업을 중심으로 - vol.20, pp.8, 2018, https://doi.org/10.5392/jkca.2020.20.08.001
  4. 정치 PR 전략으로서의 SNS 메시지 : 21대 총선을 중심으로 vol.20, pp.9, 2018, https://doi.org/10.5392/jkca.2020.20.09.208
  5. 인과관계문형 기반 사회이슈 발생원인 도출 방법 연구 vol.19, pp.3, 2018, https://doi.org/10.14400/jdc.2021.19.3.167
  6. Word2Vec를 이용한 토픽모델링의 확장 및 분석사례 vol.30, pp.1, 2021, https://doi.org/10.5859/kais.2021.30.1.45