DOI QR코드

DOI QR Code

Classification of Public Perceptions toward Smog Risks on Twitter Using Topic Modeling

Topic Modeling을 이용한 Twitter상에서 스모그 리스크에 관한 대중 인식 분류 연구

  • Kim, Yun-Ki (Department of Land Management Cheongju University)
  • Received : 2017.04.13
  • Accepted : 2017.06.20
  • Published : 2017.06.30

Abstract

The main purpose of this study was to detect and classify public perceptions toward smog disasters on Twitter using topic modeling. To help achieve these objectives and to identify gaps in the literature, this research carried out a literature review on public opinions toward smog disasters and topic modeling. The literature review indicated that there are huge gaps in the related literature. In this research, this author formed five research questions to fill the gaps in the literature. And then this study performed research steps such as data extraction, word cloud analysis on the cleaned data, building the network of terms, correlation analysis, hierarchical cluster analysis, topic modeling with the LDA, and stream graphs to answer those research questions. The results of this research revealed that there exist huge differences in the most frequent terms, the shapes of terms network, types of correlation, and smog-related topics changing patterns between New York and London. Therefore, this author could find positive answers to the four of the five research questions and a partially positive answer to Research question 4. Finally, on the basis of the results, this author suggested policy implications and recommendations for future study.

본 연구의 주된 목적은 토픽 모델링(topic modeling)을 이용하여 트위터 상에서 스모그 리스크(smog risks)에 관한 대중 인식(public perceptions)을 측정하고 분류하는 것이다. 선행연구에 있어서 연구 갭(research gap)을 확인하기 위하여 본 연구는 스모그 리스크와 토픽 모델링에 대한 선행연구를 검토하였다. 그 결과 본 저자는 기존의 연구에서 상당한 연구 갭이 존재하고 있음을 확인하였으며, 이러한 연구 갭을 메우기 위해 다섯 개의 연구 질문을 설정하였다. 연구 질문들에 답을 구하기 위하여 본 연구는 10,000개의 트위터 자료를 추출하였고, 이에 대하여 워드 클라우드 분석(word cloud analysis), 상관분석, LDA를 이용한 토픽 모델링, 스트림그래프(stream graph), 위계적 집락분석(hierarchical cluster analysis)을 실시하였다. 분석 결과 자주 언급되는 단어들(the most frequent terms), 단어네트워크(terms network)의 형태, 상관관계의 유형, 스모그 관련 주제의 변동패턴에 있어서 뉴욕과 런던 사이에 큰 차이가 있음을 확인하였다. 그리하여 본 저자는 다섯 개의 연구 질문 중 네 개에 대하여 긍정적인 답을 구할 수 있었고, 이를 토대로 몇 가지 정책적 시사점을 제시하고, 향후 연구를 위한 제안들을 하였다.

Keywords

References

  1. Aldahawi HA. 2015. Mining and analysing social network in the oil business: Twitter sentiment analysis and prediction approaches. [dissertation]. Cardiff University.
  2. Akerlof K, DeBono R, Berry P, Leiserowitz A, Roser-Renouf C, Clarke KL., Maibach EW. 2010. Public perceptions of climate change as a human health risk: surveys of the United States, Canada and Maltaz. International journal of environmental research and public health. 7(6):2559-2606. https://doi.org/10.3390/ijerph7062559
  3. Alghamdi R. & Alfalqi K. 2015. A Survey of Topic Modeling in Text Mining. International Journal of Advanced Computer Science and Applications. 6(1).
  4. Anandkumar A, Kakade SM, Foster DP, Liu YK, Hsu D. 2012. Two svds suffice: Spectral decompositions for probabilistic topic modeling and latent dirichlet allocation (No. arXiv: 1204.6703).
  5. Andrzejewski D, Zhu X, Craven M. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning. p. 25-32.
  6. Arora R, Ravindran B. 2008. Latent dirichlet allocation based multi-document summarization. In Proceedings of the second workshop on Analytics for noisy unstructured text data. p. 91-97.
  7. Arora S, Ge R, Halpern Y, Mimno DM, Moitra A, Sontag D, Zhu M. 2013. A Practical Algorithm for Topic Modeling with Provable Guarantees. In ICML. p. 280-288.
  8. Asuncion HU, Asuncion AU, Taylor RN. 2010. Software traceability with topic modeling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1 . p. 95-104.
  9. Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa GL.. 2017. A general framework to expand short text for topic modeling. Information Sciences. 393:66-81. https://doi.org/10.1016/j.ins.2017.02.007
  10. Bickerstaff K, Walker G. 2001. Public understandings of air pollution: the 'localisation' of environmental risk. Global Environmental Change. 11(2):133-145. https://doi.org/10.1016/S0959-3780(00)00063-7
  11. Bisgin H, Liu Z., Fang H, Xu X, Tong W. 2011. Mining FDA drug labels using an unsupervised learning technique-topic modeling. BMC bioinformatics. 12(10): S11.
  12. Blei DM. 2012. Probabilistic topic models. Communications of the ACM. 55(4):77-84. https://doi.org/10.1145/2133806.2133826
  13. Blei, D. M., Ng, A. Y., & Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research. 3(Jan):993-1022.
  14. Brechin SR. 2003. Comparative public opinion and knowledge on global climatic change and the Kyoto Protocol: the US versus the world?. International Journal of Sociology and Social Policy. 23(10): 106-134. https://doi.org/10.1108/01443330310790318
  15. Brody SD, Zahran S, Vedlitz A, Grover H. 2008. Examining the relationship between physical vulnerability and public perceptions of global climate change in the United States. Environment and behavior. 40(1):72-95. https://doi.org/10.1177/0013916506298800
  16. Chen J, Chen H, Pan JZ. 2016. Semantic Reasoning for Smog Disaster Analysis. In Description Logics.
  17. Chen J, Chen H, Wu Z, Hu D, Pan JZ. 2017. Forecasting smog-related health hazard based on social media and physical sensor. Information Systems. 64:281-291. https://doi.org/10.1016/j.is.2016.03.011
  18. Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R. 2013. Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. p. 209-218.
  19. Cheng P, We J, Marinova D, Guo X. 2017. Adoption of Protective Behaviours: Residents Response to City Smog in Hefei, China. Journal of Contingencies and Crisis Management. 1468-5973
  20. Cody EM, Reagan AJ, Mitchell L, Dodds PS, Danforth CM. 2015. Climate change sentiment on twitter: an unsolicited public opinion poll. PloS one. 10(8):e0136092. https://doi.org/10.1371/journal.pone.0136092
  21. Crowe MJ. 1968. Toward a "definitional model" of public perceptions of air pollution. Journal of the Air Pollution Control Association. 18(3):154-157. https://doi.org/10.1080/00022470.1968.10469106
  22. Dunlap RE. 1998. Lay perceptions of global risk: Public views of global warming in cross-national context. International sociology. 13(4):473-498. https://doi.org/10.1177/026858098013004004
  23. Elliott SJ, Cole DC, Kruege P, Voorberg N, Wakefield S. 1999. The power of perception: Health risk attributed to air pollution in anurban industrial neighbourhood. Risk analysis. 19(4):621-634. https://doi.org/10.1111/j.1539-6924.1999.tb00433.x
  24. Foulds JR, Kumar SH, Getoor L. 2015. Latent Topic Networks: A Versatile Probabilistic Programming Framework for Topic Models. p. 777-786.
  25. Fu G, Wang X. 2010. Chinese sentence-level sentiment classification based on fuzzy sets. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics. p. 312-319.
  26. Gretarsson B, O'donovan J, Bostandjiev S, Hollerer T, Asuncion A, Newman D, Smyth, P. 2012. Topicnets: Visual analysis of large text corpora with topic modeling. ACM Transactions on Intelligent Systems and Technology. 3(2):23.
  27. Hall D, Jurafsky D, Manning CD. 2008. Studying the history of ideas using topic models. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics. p. 363-371.
  28. Hofmann T. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. p. 289-296.
  29. Hofmann T. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine learning. 42(1-2):177-196. https://doi.org/10.1023/A:1007617005950
  30. Hoffman M, Bach FR, Blei DM. 2010. Online learning for latent dirichlet allocation. In advances in neural information processing systems. p. 856-864.
  31. Hong L., Davison BD. 2010. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics. p.80-88.
  32. Hu Y, Boyd-Graber J, Satinoff B, Smith A. 2014. Interactive topic modeling. Machine learning. 95(3):423-469. https://doi.org/10.1007/s10994-013-5413-0
  33. Huynh T, Fritz M, Schiele B. 2008. Discovery of activity patterns using topic models. In Proceedings of the 10th international conference on Ubiquitous computing. p. 10-19.
  34. Iacus SM, Porro G, Salini S, Siletti E. 2015. Social networks, happiness and health: from sentiment analysis to a multidimensional indicator of subjective well-being. arXiv preprint arXiv:1512.01569.
  35. Ji X, Chun SA, Wei Z, Geller J. 2015. Twitter sentiment classification for measuring public health concerns. Social Network Analysis and Mining. 5(1):1-25. https://doi.org/10.1007/s13278-014-0242-0
  36. Jiang H, Lin P, Qiang M. 2015. Public-opinion sentiment analysis for large hydro projects. Journal of Construction Engineering and Management. 142(2): 05015013.
  37. Jiang Y, Meng W, Yu C. 2011. Topic sentiment change analysis. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer Berlin Heidelberg. p. 443-457
  38. Koltsova O, Koltcov S. 2013. Mapping the public agenda with topic modeling: The case of the Russian livejournal. Policy & Internet. 5(2):207-227. https://doi.org/10.1002/1944-2866.POI331
  39. Landauer TK, Foltz PW, Laham D. 1998. An introduction to latent semantic analysis. Discourse processes. 25(2-3): 259-284. https://doi.org/10.1080/01638539809545028
  40. Lee H, Kim J, Choo J, Stasko J, Park H. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer Graphics Forum. Blackwell Publishing Ltd. 31(3):1155-1164. https://doi.org/10.1111/j.1467-8659.2012.03108.x
  41. Li J, Pearce PL, Morrison AM, Wu B. 2015. Up in Smoke? The Impact of Smog on Risk Perception and Satisfaction of International Tourists in Beijing. International Journal of Tourism Research. 10:2055.
  42. Liu B, EDU U. 2014. Topic modeling using topics from many domains. lifelong learning and big data.
  43. Lu Y, Zhai C. 2008. Opinion integration through semi-supervised topic modeling. In Proceedings of the 17th international conference on World Wide Web. p. 121-130.
  44. Macnaghten P, Grove-White R, Jacobs M, Wynne B. 1995. Public perceptions and sustainability in Lancashire. Indicators, Institutions, Participation. A report by the Centre for the Study of Environmental Change commissioned by Lancashire County Council.
  45. Mehrotra R, Sanner S, Buntine W, Xie L. 2013. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. p. 889-892.
  46. Mei Q, Cai D, Zhang D, Zhai C. 2008. Topic modeling with network regularization. In Proceedings of the 17th international conference on World Wide Web. p. 101-110.
  47. Min Z, Jianping W. 2015. Visualization Analysis on Contemporary Youth's Haze Sentiment. Youth Studies. 4:006.
  48. Montague JJ. 2016. Using Visual Communication Design To Optimize Exploration of Large Text-Mining Datasets. [dissertation]. University of Alberta.
  49. Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C. 2012. Duplicate bug report detection with a combination of information retrieval and topic modeling. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. p. 70-79.
  50. Pang B, Lee L, Vaithyanathan S. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing. Association for Computational Linguistics. 10:79-86.
  51. Paul MJ, Dredze M. 2014. Discovering health topics in social media using topic models. PloS one. 9(8):e103408. https://doi.org/10.1371/journal.pone.0103408
  52. Pingclasai N, Hata H, Matsumoto KI. 2013. Classifying bug reports to bugs and other requests using topic modeling. In Software Engineering Conference (APSEC), 2013 20th Asia-Pacific. 2:13-18.
  53. Ponweiser M. 2012. Latent Dirichlet allocation in R.
  54. Qin Z, Cong Y, Wan T. 2016. Topic modeling of Chinese language beyond a bag-of-words. Computer Speech & Language. 40:60-78. https://doi.org/10.1016/j.csl.2016.03.004
  55. Ramage D, Hall D, Nallapati R, Manning CD. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics. p. 248-256.
  56. Ritter A, Etzioni O. 2010. A latent dirichlet allocation method for selectional preferences. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. p. 424-434.
  57. Saksena S. 2007. Public perceptions of urban air pollution with a focus on developing countries.
  58. Saksena S. 2011. Public perceptions of urban air pollution risks. Risk, Hazards & Crisis in Public Policy. 2(1):1-19. https://doi.org/10.2202/1944-4079.1068
  59. Sang ETK. 2014. Using tweets for assigning sentiments to regions. In Proc. of the International Workshop on Emotion, Social Signal, Sentiment & Linked Open Data
  60. Semenza JC, Wilson DJ, Parra J, Bontempo BD, Hart M, Sailor DJ, George LA. 2008. Public perception and behavior change in relationship to hot weather and air pollution. Environmental research. 107(3):401-411. https://doi.org/10.1016/j.envres.2008.03.005
  61. Sha Y, Yan J, Cai G. 2014. Detecting public sentiment over PM2. 5 pollution hazards through analysis of Chinese microblog. In ISCRAM: The 11th International Conference on Information Systems for Crisis Response and Management. p. 722-726.
  62. Shatnawi S, Gaber MM, Cocea M. 2014. Text stream mining for Massive Open Online Courses: review and perspectives. Systems Science & Control Engineering: An Open Access Journal. 2(1):664-676. https://doi.org/10.1080/21642583.2014.970732
  63. Sluban B, Smailovic J, Juric M, Mozetic I, Battiston S. 2014. Community sentiment on environmental topics in social networks. In Signal-Image Technology and Internet-Based Systems (SITIS), 2014 Tenth International Conference on. p. 376-382.
  64. Sun C, Yuan X, Yao X. 2016a. Social acceptance towards the air pollution in China: Evidence from public's willingness to pay for smog mitigation. Energy Policy. 92:313-324. https://doi.org/10.1016/j.enpol.2016.02.025
  65. Sun C, Yuan X, Xu M. 2016b. The public perceptions and willingness to pay: from the perspective of the smog crisis in China. Journal of Cleaner Production. 112:1635-1644. https://doi.org/10.1016/j.jclepro.2015.04.121
  66. Sun L, Yin Y. 2017. Discovering themes and trends in transportation research using topic modeling. Transportation Research Part C: Emerging Technologies. 77:49-66. https://doi.org/10.1016/j.trc.2017.01.013
  67. Surian D, Nguyen DQ, Kennedy G, Johnson M, Coiera E, Dunn AG. 2016. Characterizing Twitter discussions about HPV vaccines using topic modeling and community detection. Journal of Medical Internet Research. 18(8):e232. https://doi.org/10.2196/jmir.6045
  68. Tan S, Li Y, Sun H, Guan Z, Yan X, Bu J, He X. 2014. Interpreting the public sentiment variations on twitter. ieee transactions on knowledge and data engineering. 26(5):1158-1170. https://doi.org/10.1109/TKDE.2013.116
  69. Tang J, Jin R, Zhang J. 2008. A topic modeling approach and its integration into the random walk framework for academic search. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference. p. 1055-1060.
  70. Titov I, McDonald R. 2008. Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web. p. 111-120.
  71. Tuarob S, Pouchard LC, Giles CL. 2013. Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. p. 239-248.
  72. Wallach HM. 2006. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning. p. 977-984.
  73. Wang C, Blei DM. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. p. 448-456.
  74. Weber EU, Stern PC. 2011. Public understanding of climate change in the United States. American Psychologist. 66(4):315. https://doi.org/10.1037/a0023253
  75. Yan J, Zeng J, Liu ZQ, Yang L., Gao Y. 2016. Towards big topic modeling. Information Sciences. 390:15-31.
  76. Yang S, Shi L. 2016. Public Perception of Smog: A Case Study in Ningbo City, China. Journal of the Air & Waste Management Association. (just-accepted).
  77. Yousefpour A, Ibrahim R, Hamed HNA, Hajmohammadi MS. 2014. A comparative study on sentiment analysis. Advances in Environmental Biology. 53-69.
  78. Yoon HG, Kim H, Kim CO, Song M. 2016. Opinion polarity detection in Twitter data combining shrinkage regression and topic modeling. Journal of Informetrics. 10(2):634-644. https://doi.org/10.1016/j.joi.2016.03.006
  79. Yu X. 2016. Noise Levels Associated with Sentiment Analysis on Twitter: A Case Study of New York City [dissertation]. Tufts University.
  80. Zhang D, Guo B, Yu Z. 2011. The emergence of social and community intelligence. Computer. 44(7):21-28. https://doi.org/10.1109/MC.2011.65
  81. Zhai K, Boyd-Graber J, Asadi N, Alkhouja ML. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce. In Proceedings of the 21st international conference on World Wide Web. p. 879-888.
  82. Zhao Y. 2013. Analysing twitter data with text mining and social network analysis. In Proceedings of the 11th Australasian Data Mining and Analytics Conference.
  83. Zhao W, Zou W, Chen JJ. 2014) Topic modeling for cluster analysis of large biological and medical datasets. BMC bioinformatics. 15(11):S11.
  84. Zhou Y, Lu T, Zhu T, Chen Z. 2016. Environmental Incidents Detection from Chinese Microblog Based on Sentiment Analysis. In International Conference on Human Centered Computing Springer International Publishing. p. 849-854.