DOI QR코드

DOI QR Code

Sentiment analysis on movie review through building modified sentiment dictionary by movie genre

영역별 맞춤형 감성사전 구축을 통한 영화리뷰 감성분석

  • Lee, Sang Hoon (Dept. of Mathematics, College of Natural Sciences, Hanyang University) ;
  • Cui, Jing (Dept. of Business Administration, Graduate School, Hanyang University) ;
  • Kim, Jong Woo (School of Business, Hanyang University)
  • 이상훈 (한양대학교 자연과학대학 수학과) ;
  • 최정 (한양대학교 일반대학원 경영학과) ;
  • 김종우 (한양대학교 경영대학 경영학부)
  • Received : 2016.03.11
  • Accepted : 2016.04.11
  • Published : 2016.06.30

Abstract

Due to the growth of internet data and the rapid development of internet technology, "big data" analysis is actively conducted to analyze enormous data for various purposes. Especially in recent years, a number of studies have been performed on the applications of text mining techniques in order to overcome the limitations of existing structured data analysis. Various studies on sentiment analysis, the part of text mining techniques, are actively studied to score opinions based on the distribution of polarity of words in documents. Usually, the sentiment analysis uses sentiment dictionary contains positivity and negativity of vocabularies. As a part of such studies, this study tries to construct sentiment dictionary which is customized to specific data domain. Using a common sentiment dictionary for sentiment analysis without considering data domain characteristic cannot reflect contextual expression only used in the specific data domain. So, we can expect using a modified sentiment dictionary customized to data domain can lead the improvement of sentiment analysis efficiency. Therefore, this study aims to suggest a way to construct customized dictionary to reflect characteristics of data domain. Especially, in this study, movie review data are divided by genre and construct genre-customized dictionaries. The performance of customized dictionary in sentiment analysis is compared with a common sentiment dictionary. In this study, IMDb data are chosen as the subject of analysis, and movie reviews are categorized by genre. Six genres in IMDb, 'action', 'animation', 'comedy', 'drama', 'horror', and 'sci-fi' are selected. Five highest ranking movies and five lowest ranking movies per genre are selected as training data set and two years' movie data from 2012 September 2012 to June 2014 are collected as test data set. Using SO-PMI (Semantic Orientation from Point-wise Mutual Information) technique, we build customized sentiment dictionary per genre and compare prediction accuracy on review rating. As a result of the analysis, the prediction using customized dictionaries improves prediction accuracy. The performance improvement is 2.82% in overall and is statistical significant. Especially, the customized dictionary on 'sci-fi' leads the highest accuracy improvement among six genres. Even though this study shows the usefulness of customized dictionaries in sentiment analysis, further studies are required to generalize the results. In this study, we only consider adjectives as additional terms in customized sentiment dictionary. Other part of text such as verb and adverb can be considered to improve sentiment analysis performance. Also, we need to apply customized sentiment dictionary to other domain such as product reviews.

인터넷상의 데이터가 급속하게 증가함에 따라 막대한 양의 데이터를 목적에 맞게 적절히 활용하는 빅데이터 분석이 활발하게 진행되고 있다. 최근에는 기존의 정형 데이터분석이 가진 한계점을 보완하는 방법으로 비정형 데이터 분석 분야 중 하나인 텍스트마이닝 기법에 대한 연구들이 다수 이루어지고 있으며, 특히 텍스트를 기반으로 문장의 긍정, 부정을 판별하고 분류하는 감성분석과 관련된 연구들이 활발하게 이루어지고 있다. 이러한 연구의 연장선 상에서, 본 연구는 감성분석에 사용되는 감성사전을 데이터의 특성에 맞게 적절하게 변형하여 구축하는 방법을 시도하였다. 데이터가 속한 영역의 특성을 고려하지 않은 기존의 범용 감성사전을 감성분석에 사용할 경우, 해당 영역에서 쓰이는 단어 또는 감정 표현을 반영하지 못하므로 감성분석의 정확성이 떨어질 수 있다. 따라서 감성분석에 있어서 영역 맞춤형 감성사전의 사용 시 데이터 영역의 특성을 정확하게 반영해 분석의 정확성을 높여줄 것으로 기대할 수 있다. 본 연구에서는 영화 리뷰 데이터를 분석 대상으로 선정하였으며, 대표적 영화정보 사이트 IMDb에서 발생된 약 2년간의 영화리뷰 데이터를 수집 분석하였다. 분석에 앞서 영화 장르별 사용되는 단어의 의미가 각각 다를 것을 고려하여 영화를 '액션', '애니메이션', '코메디', '드라마', '공포', '과학공상' 6개 장르로 분류했다. 맞춤형 감성사전 구축을 위한 핵심 기법으로 SO-PMI(Semantic Orientation from Point-wise Mutual Information)를 활용하였으며, 어휘 간 극성이 뚜렷하게 구분되는 형용사에 한정하여 연구를 진행했다. 분석결과 맞춤형사전을 활용한 감성분석 예측정확도는 영화 장르별로 상이했다. '애니메이션'을 제외한 5개 장르에서 기존의 범용 감성사전대비 맞춤형 감성사전의 예측정확도가 통계적으로 유의한 수준의 성능 향상을 보였다. 본 연구에서는 데이터 영역의 특성에 맞는 맞춤형 사전 구축을 통한 감성분석의 예측의 성능 향상을 확인하였다. 향후 감성사전 구축 시 동사, 부사 등 다양한 품사의 어휘를 추가하여 감성분석 예측정확도를 높이는 방안을 모색할 수 있을 것이다.

Keywords

References

  1. Adhitama P., S. H. Kim and I. S. Na, "Twitter Trending Topic Classification using Naive Bayes Classifier," Proceedings of the Korean Information Science Society Conference, Vol.40(2013), 879-881.
  2. An J. K. and H. W. Kim, "Building a Korean Sentiment Lexicon Using Collective Intelligence," Journal of Intelligent Information Systems, Vol.21, No.2(2015), 49-67. https://doi.org/10.13088/jiis.2015.21.2.49
  3. Chang J. Y., "A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in On-Line Shopping Mall," The Journal of Society for e-Business Studies, Vol.14, No.4(2009), 19-33.
  4. Cho T. M., H. N. Cho, J. D. Lee and J. H. Lee, "TV Drama Rating Prediction based on Sentiment Analysis of Viewers' Comments," Proceedings of the Korean Institute of Intelligent Systems Conference, Vol.24, No.1 (2014), 83-84.
  5. Jin W., H. H. Ho and R. K. Srihari, "OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction," KDD Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining(2009), 1195-1204.
  6. Jo, E. K., "The Current State of Affairs of the Sentiment Analysis and Case Study Based on Corpus," The Journal of Linguistic Science, Vol.61(2012), 259-282.
  7. Jo H. J., J. H. Seo and J. T. Choi, "OAR Algorithm Technology Based on Opinion Mining Utilizing Stock News Contents," Journal of Korean Institute of Information Technology, Vol.13, No.2(2015), 111-119.
  8. Kim J. H., Y. J. Oh and S. H. Chae, "The Construction of a Domain-Specific Sentiment Dictionary Using Graph-based Semi-supervised Learning Method," Korean Journal of the Science of Emotion and Sensibility, Vol.18, No.4(2015), 97-104.
  9. Kim K. P. and Y. S. Kwon, "Performance Comparison of Naive Bayesian Learning and Centroid-Based Classification for e-Mail Classification," IE Interfaces Vol.18, No.1 (2005), 10-21.
  10. Kim S. W. and N. G. Kim, "A Study on the Effect of Using Sentiment Lexicon in Opinion Classification," Journal of Intelligent Information Systems, Vol.20, No.1(2014), 133-148.
  11. Lee K. B., J. B. Baik and S. W. Lee, "Estimating a Pleasure-Displeasure Index of Word based on Word Similarity in SNS," Journal of KIISE : Computing Practices and Letters, Vol.20, No.3(2014), 159-164.
  12. Oh S. H. and S. J. Kang, "Movie Retrieval System by Analyzing Sentimental Keyword from User's Movie Reviews," Journal of the Korea Academia-Industrial cooperation Society, Vol.14, No.3(2013), 1422-1427. https://doi.org/10.5762/KAIS.2013.14.3.1422
  13. Scaffidi C., K. Bierhoff, E. Chang, M. Felker, H. Ng, and C. Jin, "Red Opal: Product-Feature Scoring from Reviews," Proceedings of the 8th ACM conference on Electronic commerce(2007), 182-191.
  14. Seo J. H., H. J. Jo and J. T. Choi, "Design for Opinion Dictionary of Emotion Applying Rules for Antonym of the Korean Grammar," Journal of Korean Institute of Information Technology, Vol.13, No.2(2015), 109-117.
  15. Song J. S., and S. W. Lee, "Automatic Construction of Positive/Negative Feature-Predicate Dictionary for Polarity Classification of Product Reviews," Journal of KIISE: Software and Applications, Vol.38, No.3 (2013), 157-168.
  16. Song S. I., D. J. Lee and S. G. Lee, "Identifying Sentiment Polarity of Korean Vocabulary Using PMI," Proceedings of the Korean Information Science Society Conference, Vol.37, No.1(2010), 260-265.
  17. Turney P. D. and M.L. Littman, "Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus," National Research Council, Institute for Information Technology, Technical Report(2002), ERB-1094.
  18. Turney P. D., and M. L. Littman, "Measuring Praise and Criticism: Inference of Semantic Orientation from Association," ACM Transactions on Information Systems (TOIS), Vol.21, No.4(2003), 315-346. https://doi.org/10.1145/944012.944013
  19. Yeon J. H., D. J. Lee, J. H. Shim and S. G. Lee, "Product Review Data and Sentiment Analytical Processing Modeling," The Journal of Society for e-Business Studies, Vol.16, No.4(2011), 125-137. https://doi.org/10.7838/jsebs.2011.16.4.125
  20. Yu E. J., Y. S. Kim, N. Y. Kim and S. R. Jeong, "Predicting the Direction of the Stock Index by Using a Domain-Specific Sentiment Dictionary," Journal of Intelligent Information Systems, Vol.19, No.1(2013), 95-10. https://doi.org/10.13088/jiis.2013.19.1.095

Cited by

  1. Public Sentiment Analysis of Korean Top-10 Companies: Big Data Approach Using Multi-categorical Sentiment Lexicon vol.22, pp.3, 2016, https://doi.org/10.13088/jiis.2016.22.3.045
  2. Methodology for Identifying Issues of User Reviews from the Perspective of Evaluation Criteria: Focus on a Hotel Information Site vol.22, pp.3, 2016, https://doi.org/10.13088/jiis.2016.22.3.023
  3. A Text Mining Analysis for Research Trend about Information and Communication Technology in Construction Automation vol.17, pp.6, 2016, https://doi.org/10.6106/KJCEM.2016.17.6.013
  4. 감성분석 결과와 사용자 만족도와의 관계 -기상청 사례를 중심으로- vol.16, pp.10, 2016, https://doi.org/10.5392/jkca.2016.16.10.393
  5. SW 교육 뉴스데이터의 감성분석 vol.21, pp.1, 2016, https://doi.org/10.14352/jkaie.2017.21.1.89
  6. 주가지수 방향성 예측을 위한 도메인 맞춤형 감성사전 구축방안 vol.18, pp.3, 2016, https://doi.org/10.9728/dcs.2017.18.3.585
  7. 뉴스기사를 이용한 소비자의 경기심리지수 생성 vol.23, pp.3, 2016, https://doi.org/10.13088/jiis.2017.23.3.001
  8. 오피니언마이닝을 이용한 사용자 맞춤 장소 추천 시스템 vol.21, pp.11, 2016, https://doi.org/10.6109/jkiice.2017.21.11.2043
  9. 용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구 vol.37, pp.4, 2016, https://doi.org/10.29214/damis.2018.37.4.003
  10. Organisational project evaluation via machine learning techniques: an exploration vol.2, pp.2, 2016, https://doi.org/10.1080/2573234x.2019.1675478
  11. CNN-LSTM 조합모델을 이용한 영화리뷰 감성분석 vol.25, pp.4, 2019, https://doi.org/10.13088/jiis.2019.25.4.141
  12. 소셜미디어를 통한 우울 경향 이용자 담론 주제 분석 vol.36, pp.4, 2019, https://doi.org/10.3743/kosim.2019.36.4.207
  13. SNS 기반 여론 감성 분석 vol.6, pp.1, 2016, https://doi.org/10.17703/jcct.2020.6.1.111
  14. 온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여 vol.26, pp.1, 2016, https://doi.org/10.13088/jiis.2020.26.1.097
  15. 「겨울왕국2」의 콜라보레이션 패션제품에 대한 소비자 리뷰 - 의미 네트워크와 감성분석 - vol.28, pp.2, 2020, https://doi.org/10.29049/rjcc.2020.28.2.265
  16. 리뷰의 의미적 토픽 분류를 적용한 감성 분석 모델 vol.9, pp.2, 2020, https://doi.org/10.30693/smj.2020.9.2.69
  17. Rating Prediction by Evaluation Item through Sentiment Analysis of Restaurant Review vol.25, pp.6, 2020, https://doi.org/10.9708/jksci.2020.25.06.081
  18. Improving customer routing in contact centers: An automated triage design based on text analytics vol.66, pp.5, 2016, https://doi.org/10.1002/joom.1084
  19. An Analysis of Mobile Food Delivery App 'Baemin' by Using Text Mining and ARIMA Model vol.22, pp.2, 2016, https://doi.org/10.9728/dcs.2021.22.2.291
  20. Perception and Appraisal of Urban Park Users Using Text Mining of Google Maps Review - Cases of Seoul Forest, Boramae Park, Olympic Park - vol.49, pp.4, 2021, https://doi.org/10.9715/kila.2021.49.4.015
  21. Fashion informatics of the Big 4 Fashion Weeks using topic modeling and sentiment analysis vol.8, pp.1, 2016, https://doi.org/10.1186/s40691-021-00265-6