Sentiment Classification of Movie Reviews using Levenshtein Distance

Levenshtein 거리를 이용한 영화평 감성 분류

  • 안광모 (충북대학교 컴퓨터공학과) ;
  • 김윤석 (충북대학교 컴퓨터공학과) ;
  • 김영훈 (청강문화산업대학교 모바일스쿨) ;
  • 서영훈 (충북대학교 컴퓨터공학과)
  • Received : 2013.12.04
  • Accepted : 2013.12.27
  • Published : 2013.12.31


In this paper, we propose a method of sentiment classification which uses Levenshtein distance. We generate BOW(Bag-Of-Word) applying Levenshtein daistance in sentiment features and used it as the training set. Then the machine learning algorithms we used were SVMs(Support Vector Machines) and NB(Naive Bayes). As the data set, we gather 2,385 reviews of movies from an online movie community (Daum movie service). From the collected reviews, we pick sentiment words up manually and sorted 778 words. In the experiment, we perform the machine learning using previously generated BOW which was applied Levenshtein distance in sentiment words and then we evaluate the performance of classifier by a method, 10-fold-cross validation. As the result of evaluation, we got 85.46% using Multinomial Naive Bayes as the accuracy when the Levenshtein distance was 3. According to the result of the experiment, we proved that it is less affected to performance of the classification in spelling errors in documents.

본 논문에서는 레빈쉬타인 거리(Levenshtein distance)를 이용한 감성 분류 방법을 제안한다. 감성 자질에 레빈쉬타인 거리를 적용하여 BOW(Back-Of-Word)를 생성하고 이를 학습 자질로 사용한다. 학습 모델은 지지벡터기계(support vector machines, SVMs)와 나이브 베이즈(Naive Bayes)를 이용하였다. 실험 데이터로는 다음 영화 사이트로부터 영화평을 수집하였으며, 수집한 영화평은 총 2,385건이다. 수집된 영화평으로부터 감성 어휘를 수작업을 통해 수집하였으며 총 778개 어휘가 선별되었다. 실험에서는 감성 어휘에 레빈쉬타인 거리를 적용한 BOW를 이용하여 기계학습을 수행하였으며, 10-fold-cross validation 방식으로 분류기의 성능을 평가하였다. 평가 결과는 레빈쉬타인 거리가 3일 때 다항 나이브 베이즈(Muitinomial Naive Bayes) 분류기에서 85.46%의 가장 높은 정확도를 보였다. 실험을 통하여 본 논문에서 제안하는 방법이 문서 내의 철자 오류에 대해서도 분류 성능에 영향을 적게 받음을 알 수 있었다.


  1. V. I. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals," Soviet Physics Doklady, Vol.10, pp.707-710, 1965
  2. C. Lee, D. Choi, S. Kim, J. Kang, "Classification and Analysis of Emotion in Korean Microblog Texts," KIISE : Databases, Vol.40, No.3, 2013 (in Korean)
  3. P. Turney, "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews," In Proceeding of the 40th Annual Meeting of the Association for Computational Linguistic, Philadelphia, pp.417-424, 2002
  4. T. Nasukawa, J. Yi, "Sentiment Analysis: Capturing Favorability using Natural Language Processing," In Proceedings of the K-CAP-03, 2nd International Co-nference on Knowledge Capure, pp.70-77, 2003
  5. J. Yi, W. Niblack, "Sentiment Mining in Web-Fountain," International Conference on Data Engineering (ICDE'05), pp.1073-1083, 2005
  6. N. Godbole, M. Srinivasaiah, S. Skiena, "Large-Scale Sentiment Analysis for News and Blogs," Intel AAAI Conference on Weblogs and Social Media (ICWSM 2007), 2007
  7. E. Boiy, P. Hens, K. Deschacht, M. Moens, "Automatic Sentiment Analysis in On-line Text," ELPUB2007 Conference on Electronic Publishing, 2007
  8. M. Gamon, A. Aue, S. Corston-Oliver, E. Ringger, "Pulse: Mining Customer Opinions from Free Text," In Lecture Notes in Computer Science, Vol.3646, Springer Verlag (IDA 2005), 2005
  9. X. Ding, B. Liu, "The Utility of Linguistic Rules in Opinion Mining," pp.811-812, SIGIR2007, 2007
  10. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, "Text Classification using String Kernels," Journal of Machine Learning Research, Vol.2, pp.419-444, 2002
  11. S. Kim, S. Park, S. Park, S. Lee, K. Kim, "A Syllable Kernel based Sentiment Classification for Movie Reviews," Journal of Korean Institute of Intelligent Systems, Vol.20, No.2, pp.202-207, 2010 (in Korean)
  12. H. Kim, S. Lee, "The Phoneme Kernel Technique based on Support Vector Machine for Emotion Classification of Mobile Texts," Journal of KIISE : Software and Application, Vol.40, No.6, pp.350-355, 2013 (in Korean)
  13. J. Kim, S. Lee, H. Yong, "Automatic Classification Scheme of Opinions Written in Korean," Journal of KIISE : Databases, Vol.38, No.6, pp.423-428, 2011

Cited by

  1. The Blog Polarity Classification Technique using Opinion Mining vol.15, pp.4, 2014,