DOI QR코드

DOI QR Code

Building a Korean Sentiment Lexicon Using Collective Intelligence

집단지성을 이용한 한글 감성어 사전 구축

  • An, Jungkook (Graduate School of Information, Yonsei University) ;
  • Kim, Hee-Woong (Graduate School of Information, Yonsei University)
  • Received : 2015.05.20
  • Accepted : 2015.06.13
  • Published : 2015.06.30

Abstract

Recently, emerging the notion of big data and social media has led us to enter data's big bang. Social networking services are widely used by people around the world, and they have become a part of major communication tools for all ages. Over the last decade, as online social networking sites become increasingly popular, companies tend to focus on advanced social media analysis for their marketing strategies. In addition to social media analysis, companies are mainly concerned about propagating of negative opinions on social networking sites such as Facebook and Twitter, as well as e-commerce sites. The effect of online word of mouth (WOM) such as product rating, product review, and product recommendations is very influential, and negative opinions have significant impact on product sales. This trend has increased researchers' attention to a natural language processing, such as a sentiment analysis. A sentiment analysis, also refers to as an opinion mining, is a process of identifying the polarity of subjective information and has been applied to various research and practical fields. However, there are obstacles lies when Korean language (Hangul) is used in a natural language processing because it is an agglutinative language with rich morphology pose problems. Therefore, there is a lack of Korean natural language processing resources such as a sentiment lexicon, and this has resulted in significant limitations for researchers and practitioners who are considering sentiment analysis. Our study builds a Korean sentiment lexicon with collective intelligence, and provides API (Application Programming Interface) service to open and share a sentiment lexicon data with the public (www.openhangul.com). For the pre-processing, we have created a Korean lexicon database with over 517,178 words and classified them into sentiment and non-sentiment words. In order to classify them, we first identified stop words which often quite likely to play a negative role in sentiment analysis and excluded them from our sentiment scoring. In general, sentiment words are nouns, adjectives, verbs, adverbs as they have sentimental expressions such as positive, neutral, and negative. On the other hands, non-sentiment words are interjection, determiner, numeral, postposition, etc. as they generally have no sentimental expressions. To build a reliable sentiment lexicon, we have adopted a concept of collective intelligence as a model for crowdsourcing. In addition, a concept of folksonomy has been implemented in the process of taxonomy to help collective intelligence. In order to make up for an inherent weakness of folksonomy, we have adopted a majority rule by building a voting system. Participants, as voters were offered three voting options to choose from positivity, negativity, and neutrality, and the voting have been conducted on one of the largest social networking sites for college students in Korea. More than 35,000 votes have been made by college students in Korea, and we keep this voting system open by maintaining the project as a perpetual study. Besides, any change in the sentiment score of words can be an important observation because it enables us to keep track of temporal changes in Korean language as a natural language. Lastly, our study offers a RESTful, JSON based API service through a web platform to make easier support for users such as researchers, companies, and developers. Finally, our study makes important contributions to both research and practice. In terms of research, our Korean sentiment lexicon plays an important role as a resource for Korean natural language processing. In terms of practice, practitioners such as managers and marketers can implement sentiment analysis effectively by using Korean sentiment lexicon we built. Moreover, our study sheds new light on the value of folksonomy by combining collective intelligence, and we also expect to give a new direction and a new start to the development of Korean natural language processing.

최근 다양한 분야에서 빅데이터의 활용과 분석에 대한 중요성이 대두됨에 따라, 뉴스기사와 댓글과 같은 비정형 데이터의 자연어 처리 기술에 기반한 감성 분석에 대한 관심이 높아지고 있다. 하지만, 한국어는 영어와는 달리 자연어 처리가 어려운 교착어로써 정보화나 정보시스템에의 활용이 미흡한 실정이다. 이에 본 연구는 감성 분석에 활용이 가능한 감성어 사전을 집단지성으로 구축하였고, 누구나 연구와 실무에 사용하도록 API서비스 플랫폼을 개방하였다(www.openhangul.com). 집단지성의 활용을 위해 국내 최대 대학생 소셜네트워크 사이트에서 대학생들을 대상으로 단어마다 긍정, 중립, 부정에 대한 투표를 진행하였다. 그리고 집단지성의 효율성을 높이기 위해 감성을 '정의'가 아닌 '분류'하는 방식인 폭소노미의 '사람들에 의한 분류법'이라는 개념을 적용하였다. 총 517,178(+)의 국어사전 단어 중 불용어 형태를 제외한 후 감성 표현이 가능한 명사, 형용사, 동사, 부사를 우선 순위로 하여, 현재까지 총 35,000(+)번의 단어에 대한 투표를 진행하였다. 본 연구의 감성어 사전은 집단지성의 참여자가 누적됨에 따라 신뢰도가 높아지도록 설계하여, 시간을 축으로 사람들이 단어에 대해 인지하는 감성의 변화도 섬세하게 반영하는 장점이 있다. 따라서 본 연구는 앞으로도 감성어 사전 구축을 위한 투표를 계속 진행할 예정이며, 현재 제공하고 있는 감성어 사전, 기본형 추출, 카테고리 추출 외에도 다양한 자연어 처리에 응용이 가능한 API들도 제공할 계획이다. 기존의 연구들이 감성 분석이나 감성어 사전의 구축과 활용에 대한 방안을 제안하는 것에만 한정되어 있는 것과는 달리, 본 연구는 집단지성을 실제로 활용하여 연구와 실무에 활용이 가능한 자원을 구축하여 개방하여 공유한다는 차별성을 가지고 있다. 더 나아가, 집단지성과 폭소노미의 특성을 결합하여 한글 감성어 사전을 구축한 새로운 시도가 향후 한글 자연어 처리의 발전에 있어 다양한 분야들의 융합적인 연구와 실무적인 참여를 이끌어 개방적 협업의 새로운 방향과 시사점을 제시 할 수 있을 것이라 기대한다.

Keywords

References

  1. Baccianella, S., A. Esuli, and F. Sebastiani, "Senti WordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining," LREC, Vol. 10(2010), 2200-2204.
  2. Ban, S. B. and C. S. Jung, "A neural network model for recognizing facial expressions based on perceptual hierarchy of facial feature points," Korean journal of cognitive science, Vol.12, No.1/2(2001), 77-89.
  3. Black, E. W., "Wikipedia and academic peer review: Wikipedia as a recognised medium for scholarly publication?," Online Information Review, Vol. 32, No. 1(2008), 73-88. https://doi.org/10.1108/14684520810865994
  4. Boder, A., "Collective intelligence: a keystone in knowledge management," Journal of Knowledge Management, Vol. 10, No. 1(2006), 81-93. https://doi.org/10.1108/13673270610650120
  5. Bollen, J., A. Pepe, and H. Mao, "Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena," arXiv preprint arXiv:0911.1583. (2009).
  6. Bonabeau, E., "Decisions 2.0: The power of collective intelligence," MIT Sloan management review, Vol. 50, No.2(2009), 45-52.
  7. Cachia, R., R. Compano, and O. D. Costa, "Grasping the potential of online social networks for foresight," Technological Forecasting and Social Change, Vol. 74, No. 8(2007), 1179-1203. https://doi.org/10.1016/j.techfore.2007.05.006
  8. Cho, S. Y., H.-K, Kim, B. Kim, and H. -W. Kim, "Predicting Movie Revenue by Online Review Mining: Using the Opening Week Online Review," Information Systems Review, Vol. 16, No. 3(2014), 111-132.
  9. Echarte, F., J. J. Astrain, A. Cordoba, and J. E. Villadangos, "Ontology of Folksonomy: A New Modelling Method," SAAKM, 289, 36(2007).
  10. Gruber, T., "Ontology of folksonomy: A mash-up of apples and oranges," International Journal on Semantic Web and Information Systems (IJSWIS), Vol. 3, No. 1(2007), 1-11. https://doi.org/10.4018/jswis.2007010101
  11. Hwang, J. S. and S. Y. Choi, "Analysis of Participants' Features in Different Collective Intelligence Models: Comparative Analysis between Korea and U.S.A.," Journal of Cybercommunication, Vol.27, No.4(2010), 257-301.
  12. Hwang, J. W., and Y. J. Ko, "A Document Sentiment Classification System Based on the Feature Weighting Method Improved by Measuring Sentence Sentiment Intensity," Journal of KIISE Vol.36, No.6(2009), 491-497.
  13. Hwang, S. H., and Y. K. Kang, "Hierarchical Triadic Context Analysis for Folksonomy-Based Web Applications," JDCTA, Vol.2, No.1(2008), 20-27.
  14. Jang, J.-Y.,"A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in On-Line Shopping Mall," The Journal of Society for e-Business Studies, Vol.14, No.4 (2009), 19-33.
  15. Jang, Y., E. Cho, and H. Kim, "An Exploratory Study on Online Prosocial Behavior," Knowledge Management Research, Vol.16, No.1(2015), 225-242. https://doi.org/10.15813/kmr.2015.16.1.011
  16. Jung, Y. C., Y. J. Choi, and S. H. Myaeng, "A Study on Negation Handling and Term Weighting Schemes and Their Effects on Mood-based Text Classification," Korean journal of cognitive science, Vol.19, No.4 (2008), 477-497.
  17. Khan, F. H., S. Bashir, and U. Qamar, "TOM: Twitter opinion mining framework using hybrid classification scheme," Decision Support Systems, Vol.57(2014), 245-257. https://doi.org/10.1016/j.dss.2013.09.004
  18. Kim, J. O., S. Lee, and H. S. Yong, "Automatic Classification Scheme of Opinions Written in Korean," Journal of KIISE: Database, Vol. 38, No.6(2011), 423-428.
  19. Kim, Y., N. Kim, and S. R. Jung, "Stock-Index Invest Model Using News Big Data Opinion Mining," Journal of Intelligence and Information Systems Vol.18, No.2(2012), 143-156. https://doi.org/10.13088/JIIS.2012.18.2.143
  20. Laney, D., "3D data management: Controlling data volume, velocity and variety," META Group, 2001.
  21. Lee, J. S., "Three-Step Probabilistic Model for Korean Morphological Analysis," Journal of KIISE Vol.38, No.5(2011), 257-268.
  22. Lee, S., and H. Yoon, "The Study on Strategy of National Information for Electronic Government of S. Korea with Public Data analysed by the Application of Scenario Planning," The Journal of The Korea Institute of Electronic Communication Sciences Vol.7, No.6(2012), 1259-1273. https://doi.org/10.13067/JKIECS.2012.7.6.1259
  23. Lee, Y.-J, "A Semantic-Based Mashup Development Tool Supporting Various Open API Types," Journal of Internet Computing and Services Vol.13, No.3(2012), 115-126. https://doi.org/10.7472/jksii.2012.13.3.115
  24. Levenshtein, V. I., "Binary codes capable of correcting deletions, insertions, and reversals," Soviet physics doklady, Vol. 10, No. 8(1966), 707-710.
  25. Levy, P., Collective intelligence, Plenum/Harper Collins, 1997.
  26. Lipsman, A., G. Mudd, M, Rich, and S. Bruich, "The power of "like": How brands reach (and influence) fans through social-media marketing," Journal of Advertising research, Vol. 52, No. 1(2012), 40. https://doi.org/10.2501/JAR-52-1-040-052
  27. Malone, T. W., R. Laubacher, and C. Dellarocas, "The collective intelligence genome," IEEE Engineering Management Review, Vol.38, No.3(2010), 21-31.
  28. McAfee, A., and E. Brynjolfsson, "Big data: the management revolution," Harvard business review, Vol. 90, No.10(2012), 61-67.
  29. Medelyan, O., and C. Legg, "Integrating Cyc and Wikipedia: Folksonomy meets rigorously defined common-sense," Proceedings of the WIKI-AI: Wikipedia and AI Workshop at the AAAI'08 Conference, Chicago, US, (2008).
  30. Nasukawa, T., and J. Yi. "Sentiment analysis: Capturing favorability using natural language processing," Proceedings of the 2nd international conference on Knowledge capture, ACM, (2003), 70-77.
  31. Ohkura, T., Y. Kiyota, and H. Nakagawa, "Browsing system for weblog articles based on automated folksonomy," Proceedings of the WWW 2006 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, at WWW, Vol. 2006(2006).
  32. Ohmukai, I., M. Hamasaki, and H. Takeda, "A proposal of community-based folksonomy with RDF metadata." Proceedings of the 4th International Semantic Web Conference (ISWC2005), (2005).
  33. Pang, B., and L. Lee, "Opinion mining and sentiment analysis," Foundations and trends in information retrieval Vol.2, No.1-2(2008), 1-135. https://doi.org/10.1561/1500000011
  34. Prentice, S., "CEO Advisory:'Big Data'Equals Big Opportunity," Gartner, March 31, 2011.
  35. Russell, T., "Contextual authority tagging: Cognitive authority through folksonomy," Unpublished manuscript. Retrieved, Vol. 11, No.16(2005).
  36. Sulis, W., "Fundamental concepts of collective intelligence," Nonlinear Dynamics, Psychology, and Life Science, Vol. 1, No.1(1997), 35-53. https://doi.org/10.1023/A:1022371810032
  37. Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede, "Lexicon-based methods for sentiment analysis," Computational linguistics, Vol. 37, No. 2(2011), 267-307. https://doi.org/10.1162/COLI_a_00049
  38. Thomas, V. W., "Folksonomy," online posting, 2007.
  39. Xu, Z., Y. Fu, J. Mao, and D. Su, "Towards the semantic web: Collaborative tag suggestions," Collaborative web tagging workshop at WWW2006, Edinburgh, Scotland, (2006).

Cited by

  1. Reliability Analysis of VOC Data for Opinion Mining vol.22, pp.4, 2016, https://doi.org/10.13088/jiis.2016.22.4.217
  2. Development of Sentiment Analysis Model for the hot topic detection of online stock forums vol.22, pp.1, 2016, https://doi.org/10.13088/jiis.2016.22.1.187
  3. Sentiment analysis on movie review through building modified sentiment dictionary by movie genre vol.22, pp.2, 2016, https://doi.org/10.13088/jiis.2016.22.2.097
  4. 토픽모델링 기반 행복과 불행 이슈 분석 및 행복 증진 방안 연구 vol.17, pp.2, 2015, https://doi.org/10.15813/kmr.2016.17.2.007
  5. 국내 핀테크 동향 및 모바일 결제 서비스 분석: 텍스트 마이닝 기법 활용 vol.23, pp.3, 2015, https://doi.org/10.22693/niaip.2016.23.3.026
  6. 텍스트 마이닝을 이용한 정보보호인식 분석 및 강화 방안 모색 vol.23, pp.4, 2015, https://doi.org/10.22693/niaip.2016.23.4.076
  7. SW 교육 뉴스데이터의 감성분석 vol.21, pp.1, 2015, https://doi.org/10.14352/jkaie.2017.21.1.89
  8. 주가지수 방향성 예측을 위한 도메인 맞춤형 감성사전 구축방안 vol.18, pp.3, 2015, https://doi.org/10.9728/dcs.2017.18.3.585
  9. 소셜빅데이터를 이용한 온라인 소비자감성지수(e-CCSI) 개발 vol.18, pp.4, 2015, https://doi.org/10.7472/jksii.2017.18.4.121
  10. 교육정책관련 여론탐색을 위한 소셜미디어 감정분석 연구 vol.24, pp.4, 2017, https://doi.org/10.22693/niaip.2017.24.4.003
  11. 게임 도메인 웹 코퍼스를 이용한 감성사전 구축 및 평가 vol.18, pp.5, 2018, https://doi.org/10.7583/jkgs.2018.18.5.113
  12. 용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구 vol.37, pp.4, 2015, https://doi.org/10.29214/damis.2018.37.4.003
  13. Does Deceptive Marketing Pay? The Evolution of Consumer Sentiment Surrounding a Pseudo-Product-Harm Crisis vol.158, pp.3, 2015, https://doi.org/10.1007/s10551-017-3720-2
  14. What content and context factors lead to selection of a video clip? The heuristic route perspective vol.19, pp.3, 2015, https://doi.org/10.1007/s10660-019-09355-6
  15. 텍스트 마이닝을 활용한 2017년 한국 대선 분석 vol.11, pp.5, 2015, https://doi.org/10.15207/jkcs.2020.11.5.199
  16. Rating Prediction by Evaluation Item through Sentiment Analysis of Restaurant Review vol.25, pp.6, 2020, https://doi.org/10.9708/jksci.2020.25.06.081
  17. Building the Korean Sentiment Lexicon for Finance (KOSELF) vol.50, pp.2, 2015, https://doi.org/10.26845/kjfs.2021.04.50.2.135
  18. 관광 빅데이터 기반의 용인시 관내 관광 활성화 방안: 이동통신과 신용카드 데이터를 결합한 지리정보시스템 분석을 중심으로 vol.12, pp.4, 2015, https://doi.org/10.15207/jkcs.2021.12.4.207