DOI QR코드

DOI QR Code

Twitter Issue Tracking System by Topic Modeling Techniques

토픽 모델링을 이용한 트위터 이슈 트래킹 시스템

  • Bae, Jung-Hwan (Dept. of Library and Information Science, Yonsei University) ;
  • Han, Nam-Gi (Dept. of Library and Information Science, Yonsei University) ;
  • Song, Min (Dept. of Library and Information Science, Yonsei University)
  • 배정환 (연세대학교 문헌정보학과 대학원) ;
  • 한남기 (연세대학교 문헌정보학과 대학원) ;
  • 송민 (연세대학교 문헌정보학과)
  • Received : 2014.06.15
  • Accepted : 2014.06.21
  • Published : 2014.06.30

Abstract

People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.

현재 우리는 소셜 네트워크 서비스(Social Network Service, 이하 SNS) 상에서 수많은 데이터를 만들어 내고 있다. 특히, 모바일 기기와 SNS의 결합은 과거와는 비교할 수 없는 대량의 데이터를 생성하면서 사회적으로도 큰 영향을 미치고 있다. 이렇게 방대한 SNS 데이터 안에서 사람들이 많이 이야기하는 이슈를 찾아낼 수 있다면 이 정보는 사회 전반에 걸쳐 새로운 가치 창출을 위한 중요한 원천으로 활용될 수 있다. 본 연구는 이러한 SNS 빅데이터 분석에 대한 요구에 부응하기 위해, 트위터 데이터를 활용하여 트위터 상에서 어떤 이슈가 있었는지 추출하고 이를 웹 상에서 시각화 하는 트위터이슈 트래킹 시스템 TITS(Twitter Issue Tracking System)를 설계하고 구축 하였다. TITS는 1) 일별 순위에 따른 토픽 키워드 집합 제공 2) 토픽의 한달 간 일별 시계열 그래프 시각화 3) 토픽으로서의 중요도를 점수와 빈도수에 따라 Treemap으로 제공 4) 키워드 검색을 통한 키워드의 한달 간 일별 시계열 그래프 시각화의 기능을 갖는다. 본 연구는 SNS 상에서 실시간으로 발생하는 빅데이터를 Open Source인 Hadoop과 MongoDB를 활용하여 분석하였고, 이는 빅데이터의 실시간 처리가 점점 중요해지고 있는 현재 매우 주요한 방법론을 제시한다. 둘째, 문헌정보학 분야뿐만 아니라 다양한 연구 영역에서 사용하고 있는 토픽 모델링 기법을 실제 트위터 데이터에 적용하여 스토리텔링과 시계열 분석 측면에서 유용성을 확인할 수 있었다. 셋째, 연구 실험을 바탕으로 시각화와 웹 시스템 구축을 통해 실제 사용 가능한 시스템으로 구현하였다. 이를 통해 소셜미디어에서 생성되는 사회적 트렌드를 마이닝하여 데이터 분석을 통한 의미 있는 정보를 제공하는 실제적인 방법을 제시할 수 있었다는 점에서 주요한 의의를 갖는다. 본 연구는 JSON(JavaScript Object Notation) 파일 포맷의 1억 5천만개 가량의 2013년 3월 한국어 트위터 데이터를 실험 대상으로 한다.

Keywords

References

  1. Bae, J. H., J. E. Son, and M. Song, "Analysis of Twitter for 2012 South Korea Presidential Election by Text Mining Techniques," Journal of Intelligence and Information Systems, Vol.19, No.3(2013), 141-156. https://doi.org/10.13088/jiis.2013.19.3.141
  2. Bae, S. J. and Y. J. Ko, "Automatic Construction of Korean Named Entity Dictionaries from Wikipedia," Proceedings of Korea Computer Congress, (2009), 78-79.
  3. Blei, D., A. Ng, and M. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol.3(2003), 993-1022.
  4. Byeon, J. H., J. M. Oh, and N. M. Moon, "A Study on Keyword Discovery Based on Social Network Service," HCI, (2011), 471-474.
  5. Han, S. H., "Thesaurus Updating Using Collective Intelligence: Based on Wikipedia Encyclopedia," Journal of the Korean Society for Information Management, Vol.26, No.3(2009), 25-43. https://doi.org/10.3743/KOSIM.2009.26.3.025
  6. Jin, S. A., G. E. Heo, Y. K. Jeong, and M. Song, "Topic-Network based Topic Shift Detection on Twitter," Journal of the Korean Society for Information Management, Vol.30, No.1(2013), 285-302. https://doi.org/10.3743/KOSIM.2013.30.1.285
  7. Kang, B. I., M. Song, and W. S. Jho, "A Study on Opinion Mining of Newspaper Texts based on Topic Modeling," Journal of the Korean Library and Information Science Society, Vol.47, No.4(2013), 315-334. https://doi.org/10.4275/KSLIS.2013.47.4.315
  8. Kim, H. D., "Message Attributes, Consequences, and Values in Retweet Behavior : Based on Laddering Method," The Journal of the Korea Contents Association, Vol.13, No.3(2013), 131-140. https://doi.org/10.5392/JKCA.2013.13.03.131
  9. Kim, H. j., I. S. Son, and D. W. Lee, "The Viral Effect of Online Social Network on New Products Promotion : Investigating Information Diffusion on Twitter," Journal of Intelligence and Information Systems, Vol.18, No.2(2012), 107-130.
  10. Kim, Y. H. and Y. M. Chung, "An Experimental Study on Feature Selection Using Wikipedia for Text Categorization," Journal of the Korean Society for Information Management, Vol.29, No.2(2012), 155-171. https://doi.org/10.3743/KOSIM.2012.29.2.155
  11. Nam, Y. W., I. S. Son, and D. W. Lee, "The Impact of Message Characteristics on Online Viral Diffusion in Online Social Media Services : The Case of Twitter," Journal of Intelligence and Information Systems, Vol.17, No.4(2011), 75-94.
  12. Ryu, W. J., J. W. Ha, Md. Hijbul Alam, and S. K. Sang, "Extracting Trends from Twitter using a Topic Modeling Technique," Proceedings of Korea Computer Congress, (2013), 191-193.
  13. Sohn, J. S., S. W. Cho, K. L. Kwon, and I. J. Chung, "Improved Social Network Analysis Method in SNS," Journal of Intelligence and Information Systems, Vol.18, No.4(2012), 117-127.
  14. ALL IDC research, Consumers and the Digital Universe, EMC, 2014. Available at http://www.emc.com/infographics/digital-universe-consumer-infographic.htm.
  15. IDC, IDC, Big Data technologies and services worldwide market forecast $ 32.4 billion in 2017, IDC, 2014. Available at http://www.idckorea.com/product/Getdoc.asp?idx=585&field=PressRelease.

Cited by

  1. Topic Model Analysis of Research Trend on Spatial Big Data vol.41, pp.1, 2015, https://doi.org/10.7232/JKIIE.2015.41.1.064
  2. Analyzing the Issue Life Cycle by Mapping Inter-Period Issues vol.20, pp.4, 2014, https://doi.org/10.13088/jiis.2014.20.4.25
  3. 국내 핀테크 동향 및 모바일 결제 서비스 분석: 텍스트 마이닝 기법 활용 vol.23, pp.3, 2014, https://doi.org/10.22693/niaip.2016.23.3.026
  4. 텍스트 마이닝을 활용한 직업학 연구동향 분석 vol.18, pp.3, 2014, https://doi.org/10.5762/kais.2017.18.3.586
  5. 교육정책관련 여론탐색을 위한 소셜미디어 감정분석 연구 vol.24, pp.4, 2017, https://doi.org/10.22693/niaip.2017.24.4.003
  6. 빅데이터 연구동향 분석: 토픽 모델링을 중심으로 vol.15, pp.1, 2014, https://doi.org/10.17662/ksdim.2019.15.1.001
  7. Insight from Scientific Study in Logistics using Text Mining vol.2673, pp.4, 2014, https://doi.org/10.1177/0361198119834905
  8. Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study vol.120, pp.2, 2014, https://doi.org/10.1007/s11192-019-03137-5
  9. 토픽 모델링을 활용한 COVID-19 발생 전후 간호사 관련 토픽 비교: 인터넷 포털과 소셜미디어를 중심으로 vol.27, pp.3, 2020, https://doi.org/10.5953/jmjh.2020.27.3.255
  10. 인과관계문형 기반 사회이슈 발생원인 도출 방법 연구 vol.19, pp.3, 2014, https://doi.org/10.14400/jdc.2021.19.3.167