DOI QR코드

DOI QR Code

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis

비정형 텍스트 분석을 활용한 이슈의 동적 변이과정 고찰

  • 임명수 (국민대학교 비즈니스IT전문대학원) ;
  • 김남규 (국민대학교 경영대학 경영정보학부)
  • Received : 2015.11.25
  • Accepted : 2016.02.09
  • Published : 2016.03.31

Abstract

Owing to the extensive use of Web media and the development of the IT industry, a large amount of data has been generated, shared, and stored. Nowadays, various types of unstructured data such as image, sound, video, and text are distributed through Web media. Therefore, many attempts have been made in recent years to discover new value through an analysis of these unstructured data. Among these types of unstructured data, text is recognized as the most representative method for users to express and share their opinions on the Web. In this sense, demand for obtaining new insights through text analysis is steadily increasing. Accordingly, text mining is increasingly being used for different purposes in various fields. In particular, issue tracking is being widely studied not only in the academic world but also in industries because it can be used to extract various issues from text such as news, (SocialNetworkServices) to analyze the trends of these issues. Conventionally, issue tracking is used to identify major issues sustained over a long period of time through topic modeling and to analyze the detailed distribution of documents involved in each issue. However, because conventional issue tracking assumes that the content composing each issue does not change throughout the entire tracking period, it cannot represent the dynamic mutation process of detailed issues that can be created, merged, divided, and deleted between these periods. Moreover, because only keywords that appear consistently throughout the entire period can be derived as issue keywords, concrete issue keywords such as "nuclear test" and "separated families" may be concealed by more general issue keywords such as "North Korea" in an analysis over a long period of time. This implies that many meaningful but short-lived issues cannot be discovered by conventional issue tracking. Note that detailed keywords are preferable to general keywords because the former can be clues for providing actionable strategies. To overcome these limitations, we performed an independent analysis on the documents of each detailed period. We generated an issue flow diagram based on the similarity of each issue between two consecutive periods. The issue transition pattern among categories was analyzed by using the category information of each document. In this study, we then applied the proposed methodology to a real case of 53,739 news articles. We derived an issue flow diagram from the articles. We then proposed the following useful application scenarios for the issue flow diagram presented in the experiment section. First, we can identify an issue that actively appears during a certain period and promptly disappears in the next period. Second, the preceding and following issues of a particular issue can be easily discovered from the issue flow diagram. This implies that our methodology can be used to discover the association between inter-period issues. Finally, an interesting pattern of one-way and two-way transitions was discovered by analyzing the transition patterns of issues through category analysis. Thus, we discovered that a pair of mutually similar categories induces two-way transitions. In contrast, one-way transitions can be recognized as an indicator that issues in a certain category tend to be influenced by other issues in another category. For practical application of the proposed methodology, high-quality word and stop word dictionaries need to be constructed. In addition, not only the number of documents but also additional meta-information such as the read counts, written time, and comments of documents should be analyzed. A rigorous performance evaluation or validation of the proposed methodology should be performed in future works.

최근 가용한 텍스트 데이터 자원이 증가함에 따라 방대한 텍스트 분석을 통해 새로운 가치를 창출하고자 하는 수요가 증가하고 있다. 특히 뉴스, 민원, 블로그, SNS 등을 통해 유통되는 글로부터 다양한 이슈를 발굴해내고 이들 이슈의 추이를 분석하는 이슈 트래킹에 대한 연구가 활발하게 이루어지고 있다. 전통적인 이슈 트래킹은 토픽 모델링을 통해 오랜 기간에 걸쳐 지속된 주요 이슈를 발굴한 후, 각 이슈를 구성하는 문서 수의 세부 기간별 분포를 분석하는 방식으로 이루어진다. 하지만 전통적 이슈 트래킹은 각 이슈를 구성하는 내용이 전체 기간에 걸쳐 변화 없이 유지된다는 가정 하에 수행되기 때문에, 다양한 세부 이슈가 서로 영향을 주며 생성, 병합, 분화, 소멸하는 이슈의 동적 변이과정을 나타내지 못한다. 또한 전체 기간에 걸쳐 지속적으로 출현한 키워드만이 이슈 키워드로 도출되기 때문에, 핵실험, 이산가족 등 세부 기간의 분석에서는 매우 상이한 맥락으로 파악되는 구체적인 이슈가 오랜 기간의 분석에서는 북한이라는 큰 이슈에 함몰되어 가려지는 현상이 발생할 수 있다. 본 연구에서는 이러한 한계를 극복하기 위해 각 세부 기간의 문서에 대한 독립적인 분석을 통해 세부 기간별 주요 이슈를 도출한 후, 각 이슈의 유사도에 기반하여 이슈 흐름도를 도출하고자 한다. 또한 각 문서의 카테고리 정보를 활용하여 카테고리간의 이슈 전이 패턴을 분석하고자 한다. 본 논문에서는 총 53,739건의 신문 기사에 제안 방법론을 적용한 실험을 수행하였으며, 이를 통해 전통적인 이슈 트래킹을 통해 발굴한 주요 이슈의 세부 기간별 구성 내용을 살펴볼 수 있을 뿐 아니라, 특정 이슈의 선행 이슈와 후행 이슈를 파악할 수 있음을 확인하였다. 또한 카테고리간 분석을 통해 단방향 전이와 양방향 전이의 흥미로운 패턴을 발견하였다.

Keywords

References

  1. Aggarwal, A., G. Waghmare, and A. Sureka, "Mining Issue Tracking Systems Using Topic Models for Trend Analysis, Corpus Exploration and Understanding Evolution," Proceedings of the 3rd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, (2014), 52-58.
  2. Albright, R., Taming Text with the SVD, SAS Institute Inc, 2006.
  3. Alsumait, L., D. Barbara, and C. Domeniconi, "On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking," Proceedings of the 8th IEEE International Conference on Data Mining in Data Mining, (2008), 3-12.
  4. Bae, J. H., N. G. Han, and M. Song, "Twitter Issue Tracking System by Topic Modeling Techniques," Journal of Intelligence and Information Systems, Vol.20, No.2(2014), 109-122. https://doi.org/10.13088/JIIS.2014.20.2.109
  5. Han, J., M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann Publishers, 2011.
  6. Haribhakta, Y., A. Malgaonkar, and P. Kulkarni, "Unsupervised Topic Detection Model and Its Application in Text Categorization," Proceedings of the CUBE International Information Technology Conference, (2012), 314-319.
  7. Hearst, M. A., "Untangling Text Data Mining," Proceedings of the 37th ACL, (1999), 3-10.
  8. Jeong, D. H. and M. Song, "Time Gap Analysis by the Topic Model-Based Temporal Technique," Journal of Informetrics, Vol.8, No.3(2014), 776-790. https://doi.org/10.1016/j.joi.2014.07.005
  9. Kim, J., N. Kim, and Y. Cho, "User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis," Journal of Intelligence and Information Systems, Vol.20, No.2(2014), 93-107. https://doi.org/10.13088/JIIS.2014.20.2.093
  10. Kim, J. C., J. H. Lee, G. J. Kim, S. S. Park, and D. S. Jang, "Data Engineering : Time Series Analysis of Patent Keywords for Forecasting Emerging Technology," KIPS Transactions on Software and Data Engineering, Vol.3, No.9(2014), 355-360. https://doi.org/10.3745/KTSDE.2014.3.9.355
  11. Lim, M., and N. Kim "Analyzing the Issue Life Cycle by Mapping Inter-Period Issues," Journal of Intelligence and Information Systems. Vol.20, No.4(2014), 25-41. https://doi.org/10.13088/jiis.2014.20.4.25
  12. Liu, C., and N. Kim, "Individual Interests Tracking : Beyond Macro-level Issue Tracking," Journal of The Korea Society of IT Services, Vol.13, No.4(2014), 275-287. https://doi.org/10.9716/KITS.2014.13.4.275
  13. Ma, J., Y. Wang, H. Zhu, and Y. Shen, "Research on Method of Adaptive Topic Tracking Based on Evolution of Public Opinion Ontology," ACEEE International Journal on Information Technology, Vol.4, No.1(2014), 1-10.
  14. Mooney, R. J. and R. Bunescu, "Mining Knowledge from Text using Information Extraction," ACM SIGKDD Explorations, Vol.7, No.1(2006), 3-10.
  15. Morinaga, S. and K. Yamanishi, "Tracking Dynamics of Topic Trends Using a Finite Mixture Model," Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2004), 811-816.
  16. Park, J. H. and M. Song, "A Study on the Research Trends in Library & Information Science in Korea using Topic Modeling," Journal of the Korean Society for Information Management, Vol.30, No.1(2013), 7-32. https://doi.org/10.3743/KOSIM.2013.30.1.007
  17. Provost, F. and T. Fawcett, Data Science for Business, O'Reilly, 2013.
  18. Rajaraman, K. and A. H. Tan, "Topic Detection, Tracking, and Trend Analysis Using Self-Organizing Neural Networks," Proceedings of Advances in Knowledge Discovery and Data Mining, (2001), 102-107.
  19. Salton, G., A. Wong, and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol.18, No.11(1975), 613-620. https://doi.org/10.1145/361219.361220
  20. Sebastiani, F., "Machine Learning in Automated Text Categorization," ACM Computing Surveys, Vol.34, No.1(2002), 1-47. https://doi.org/10.1145/505282.505283
  21. Sebastiani, F., "Classification of Text, Automatic," The Encyclopedia of Language and Linguistics, Vol.14, 2nd Edition, Elsevier Science Pub, 2006.
  22. Stanvrianou, A., P. Andritsos, and N. Nicoloyannis, "Overview and Semantic Issues of Text Mining," ACM SIGMOD Record, Vol.36, No.3(2007), 23-34. https://doi.org/10.1145/1324185.1324190
  23. Wang, X. and A. McCallum, "Topics Over Time: a Non-Markov Continuous-Time Model of Topical Trends," Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2006), 424-433.
  24. Witten, I. H., Text Mining, Practical Handbook of Internet Computing, CRC Press, 2004.
  25. Yu, E., Y. Kim, N. Kim, and S. R. Jeong, "Predicting the Direction of the Stock Index by Using a Domain-Specific Sentiment Dictionary," Journal of Intelligence and Information Systems, Vol.19, No.1(2013), 95-110. https://doi.org/10.13088/jiis.2013.19.1.095

Cited by

  1. An Investigation into the Determination of Show-rooming: Focused on Migration Theory vol.17, pp.4, 2016, https://doi.org/10.15706/jksms.2016.17.4.004
  2. The Differences between Multi-channel and Single-channel Customer: Focusing on Shopping Motivation, Store and Product Attribute Importance, and Channel Switching Behavior vol.20, pp.2, 2016, https://doi.org/10.17961/jdmr.20.2.201704.35
  3. 뉴스기사 분석을 통한 사회이슈와 가격에 관한 연구 - 조류인플루엔자와 달걀가격 중심으로 - vol.7, pp.1, 2018, https://doi.org/10.30693/smj.2018.7.1.45
  4. 비정형 빅데이터 분석 및 디자인씽킹을 활용한 민원문제 해결에 대한 고찰 vol.9, pp.8, 2018, https://doi.org/10.13106/ijidb.2018.vol9.no8.51.
  5. 2019년 강원도 화재 보도에 대한 언어망 분석: 미디어의제 분석을 중심으로 vol.19, pp.11, 2016, https://doi.org/10.5392/jkca.2019.19.11.153
  6. 인과관계문형 기반 사회이슈 발생원인 도출 방법 연구 vol.19, pp.3, 2016, https://doi.org/10.14400/jdc.2021.19.3.167