DOI QR코드

DOI QR Code

Effective Korean sentiment classification method using word2vec and ensemble classifier

Word2vec과 앙상블 분류기를 사용한 효율적 한국어 감성 분류 방안

  • Received : 2017.10.20
  • Accepted : 2018.01.29
  • Published : 2018.01.31

Abstract

Accurate sentiment classification is an important research topic in sentiment analysis. This study suggests an efficient classification method of Korean sentiment using word2vec and ensemble methods which have been recently studied variously. For the 200,000 Korean movie review texts, we generate a POS-based BOW feature and a feature using word2vec, and integrated features of two feature representation. We used a single classifier of Logistic Regression, Decision Tree, Naive Bayes, and Support Vector Machine and an ensemble classifier of Adaptive Boost, Bagging, Gradient Boosting, and Random Forest for sentiment classification. As a result of this study, the integrated feature representation composed of BOW feature including adjective and adverb and word2vec feature showed the highest sentiment classification accuracy. Empirical results show that SVM, a single classifier, has the highest performance but ensemble classifiers show similar or slightly lower performance than the single classifier.

감성 분석에서 정확한 감성 분류는 중요한 연구 주제이다. 본 연구는 최근 많은 연구가 이루어지는 word2vec과 앙상블 방법을 이용하여 효과적으로 한국어 리뷰를 감성 분류하는 방법을 제시한다. 연구는 20 만 개의 한국 영화 리뷰 텍스트에 대해, 품사 기반 BOW 자질과 word2vec를 사용한 자질을 생성하고, 두 개의 자질 표현을 결합한 통합 자질을 생성했다. 감성 분류를 위해 Logistic Regression, Decision Tree, Naive Bayes, Support Vector Machine의 단일 분류기와 Adaptive Boost, Bagging, Gradient Boosting, Random Forest의 앙상블 분류기를 사용하였다. 연구 결과로 형용사와 부사를 포함한 BOW자질과 word2vec자질로 구성된 통합 자질 표현이 가장 높은 감성 분류 정확도를 보였다. 실증결과, 단일 분류기인 SVM이 가장 높은 성능을 나타내었지만, 앙상블 분류기는 단일 분류기와 비슷하거나 약간 낮은 성능을 보였다.

Keywords

References

  1. B. Liu, "Sentiment analysis and opinion mining," Synthesis lectures on human language technologies, vol. 5, no. 1, pp. 1-167, 2012. https://doi.org/10.2200/S00416ED1V01Y201204HLT016
  2. M. Giatsoglou, M. G. Vozalis, K. Diamantaras, A. Vakali, G. Sarigiannidis, and K. C. Chatzisavvas, "Sentiment analysis leveraging emotions and word embeddings," Expert Systems with Applications, vol. 69, pp. 214-224, 2017. https://doi.org/10.1016/j.eswa.2016.10.043
  3. P. Zhang and Z. He, "Using data-driven feature enrichment of text representation and ensemble technique for sentence-level polarity classification," Journal of Information Science, vol. 41, no. 4, pp. 531-549, 2015. https://doi.org/10.1177/0165551515585264
  4. Z. Zhai, B. Liu, H. Xu, and P. Jia, "Clustering product features for opinion mining," in Proceedings of the fourth ACM international conference on Web search and data mining, 2011, pp. 347-354.
  5. N. F. F. Da Silva, E. R. Hruschka, and E. R. Hruschka, "Tweet sentiment analysis with classifier ensembles," Decision Support Systems, vol. 66, pp. 170-179, 2014. https://doi.org/10.1016/j.dss.2014.07.003
  6. A. Tripathy, A. Agrawal, and S. K. Rath, "Classification of sentiment reviews using n-gram machine learning approach," Expert Systems with Applications, vol. 57, pp. 117-126, 2016. https://doi.org/10.1016/j.eswa.2016.03.028
  7. S. Wang and C. D. Manning, "Baselines and bigrams: Simple, good sentiment and topic classification," in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Vol. 2, 2012, pp. 90-94.
  8. Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188-1196.
  9. A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, "Learning Word Vectors for Sentiment Analysis," Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142-150, 2011.
  10. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," In Proceedings of workshop at ICLR, pp. 1-12, 2013.
  11. B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up?: sentiment classification using machine learning techniques," in Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Vol. 10, 2002, pp. 79-86.
  12. D. Chatzakou and A. Vakali, "Harvesting opinions and emotions from social media textual resources," IEEE Internet Computing, vol. 19, no. 4, pp. 46-50, 2015. https://doi.org/10.1109/MIC.2015.28
  13. R. Xia, C. Zong, and S. Li, "Ensemble of feature sets and classification algorithms for sentiment classification," Information Sciences, vol. 181, no. 6, pp. 1138-1152, 2011. https://doi.org/10.1016/j.ins.2010.11.023
  14. G. Wang, J. Sun, J. Ma, K. Xu, and J. Gu, "Sentiment classification: The contribution of ensemble learning," Decision Support Systems, vol. 57, no. 1, pp. 77-93, 2014. https://doi.org/10.1016/j.dss.2013.08.002
  15. H.-S. L. Dong-yub Lee Jae-Choon Jo, "User Sentiment Analysis on Amazon Fashion Product Review Using Word Embedding," Journal of the Korea Convergence Society, vol. 8, no. 4, pp. 1-8, 2017. https://doi.org/10.15207/JKCS.2017.8.4.001
  16. J. Lilleberg, Y. Zhu, and Y. Zhang, "Support Vector Machines and Word2vec for Text Classification with Semantic Features," Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on. IEEE, 2015., pp. 136-140, 2015.
  17. Y. Ren, R. Wang, and D. Ji, "A topic-enhanced word embedding for Twitter sentiment classification," Information Sciences, vol. 369, pp. 188-198, 2016. https://doi.org/10.1016/j.ins.2016.06.040
  18. Y. Bengio, H. Schwenk, J.-S. Senecal, F. Morin, and J.-L. Gauvain, "Neural Probabilistic Language Models," in Innovations in Machine Learning: Theory and Applications, D. E. Holmes and L. C. Jain, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 137-186.
  19. K.-M. Ahn, Y.-S. Kim, Y.-H. Kim, and Y.-H. Seo, "Sentiment Classification of Movie Reviews using Levenshtein Distance," Journal of Digital Contents Society, vol. 14, no. 4, pp. 581-587, Dec. 2013. https://doi.org/10.9728/dcs.2013.14.4.581
  20. Y. Kim and M. Song, "A Study on Analyzing Sentiments on Movie Reviews by Multi-Level Sentiment Classifier," Journal of Intelligence and Information Systems, vol. 22, no. 3, pp. 71-89, 2016. https://doi.org/10.13088/jiis.2016.22.3.071
  21. Y. Jung, K. Park, T. Lee, J. Chae, and S. Jung, "A corpus-based approach to classifying emotions using Korean linguistic features," Cluster Computing, vol. 20, no. 1, pp. 583-595, 2017. https://doi.org/10.1007/s10586-017-0777-8
  22. C. Lee, K. Hyun, Y. Byeong, M. Mun, and S. Joo, "Informal Quality Data Analysis via Sentimental analysis and," Journal of the Korean Society for Quality Management, vol. 45, no. 1, pp. 117-127, 2017. https://doi.org/10.7469/JKSQM.2017.45.1.117
  23. Y. Kim and H. Shin, "Finding Sentiment Dimension in Vector Space of Movie Reviews: An Unsupervised Approach," Journal of Cognitive Science, pp. 85-101, 2017.
  24. S.-Y. O. Chan Heo, "A Novel Method for Constructing Sentiment Dictionaries using Word2vec and Label Propagation," Journal of Korean Institute of Next Generation Computing, vol. 13, no. 2, pp. 93-101, 2017.
  25. E. L. Park and S. Cho, "KoNLPy: Korean natural language processing in Python," in Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, 2014, pp. 133-136.

Cited by

  1. Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구 vol.26, pp.1, 2020, https://doi.org/10.13088/jiis.2020.26.1.001