DOI QR코드

DOI QR Code

Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms

중립도 기반 선택적 단어 제거를 통한 유용 리뷰 분류 정확도 향상 방안

  • Lee, Minsik (Department of Business Administration, The Catholic University of Korea) ;
  • Lee, Hong Joo (Department of Business Administration, The Catholic University of Korea)
  • 이민식 (가톨릭대학교 경영학전공) ;
  • 이홍주 (가톨릭대학교 경영학전공)
  • Received : 2016.08.17
  • Accepted : 2016.09.27
  • Published : 2016.09.30

Abstract

Customer product reviews have become one of the important factors for purchase decision makings. Customers believe that reviews written by others who have already had an experience with the product offer more reliable information than that provided by sellers. However, there are too many products and reviews, the advantage of e-commerce can be overwhelmed by increasing search costs. Reading all of the reviews to find out the pros and cons of a certain product can be exhausting. To help users find the most useful information about products without much difficulty, e-commerce companies try to provide various ways for customers to write and rate product reviews. To assist potential customers, online stores have devised various ways to provide useful customer reviews. Different methods have been developed to classify and recommend useful reviews to customers, primarily using feedback provided by customers about the helpfulness of reviews. Most shopping websites provide customer reviews and offer the following information: the average preference of a product, the number of customers who have participated in preference voting, and preference distribution. Most information on the helpfulness of product reviews is collected through a voting system. Amazon.com asks customers whether a review on a certain product is helpful, and it places the most helpful favorable and the most helpful critical review at the top of the list of product reviews. Some companies also predict the usefulness of a review based on certain attributes including length, author(s), and the words used, publishing only reviews that are likely to be useful. Text mining approaches have been used for classifying useful reviews in advance. To apply a text mining approach based on all reviews for a product, we need to build a term-document matrix. We have to extract all words from reviews and build a matrix with the number of occurrences of a term in a review. Since there are many reviews, the size of term-document matrix is so large. It caused difficulties to apply text mining algorithms with the large term-document matrix. Thus, researchers need to delete some terms in terms of sparsity since sparse words have little effects on classifications or predictions. The purpose of this study is to suggest a better way of building term-document matrix by deleting useless terms for review classification. In this study, we propose neutrality index to select words to be deleted. Many words still appear in both classifications - useful and not useful - and these words have little or negative effects on classification performances. Thus, we defined these words as neutral terms and deleted neutral terms which are appeared in both classifications similarly. After deleting sparse words, we selected words to be deleted in terms of neutrality. We tested our approach with Amazon.com's review data from five different product categories: Cellphones & Accessories, Movies & TV program, Automotive, CDs & Vinyl, Clothing, Shoes & Jewelry. We used reviews which got greater than four votes by users and 60% of the ratio of useful votes among total votes is the threshold to classify useful and not-useful reviews. We randomly selected 1,500 useful reviews and 1,500 not-useful reviews for each product category. And then we applied Information Gain and Support Vector Machine algorithms to classify the reviews and compared the classification performances in terms of precision, recall, and F-measure. Though the performances vary according to product categories and data sets, deleting terms with sparsity and neutrality showed the best performances in terms of F-measure for the two classification algorithms. However, deleting terms with sparsity only showed the best performances in terms of Recall for Information Gain and using all terms showed the best performances in terms of precision for SVM. Thus, it needs to be careful for selecting term deleting methods and classification algorithms based on data sets.

전자상거래에서 소비자들의 구매 의사결정에 판매 제품을 이미 구매하여 사용한 고객의 리뷰가 중요한 영향을 미치고 있다. 전자상거래 업체들은 고객들이 제품 리뷰를 남기도록 유도하고 있으며, 구매고객들도 적극적으로 자신의 경험을 공유하고 있다. 한 제품에 대한 고객 리뷰가 너무 많아져서 구매하려는 제품의 모든 리뷰를 읽고 제품의 장단점을 파악하는 것은 무척 힘든 일이 되었다. 전자상거래 업체들과 연구자들은 텍스트 마이닝을 활용하여 리뷰들 중에서 유용한 리뷰들의 속성을 파악하거나 유용한 리뷰와 유용하지 않은 리뷰를 미리 분류하는 노력을 수행하고 있다. 고객들에게 유용한 리뷰를 필터링하여 전달하는 방안이다. 본 연구에서는 문서-단어 매트릭스에서 단어의 제거 기준으로 온라인 고객 리뷰가 유용한 지, 그렇지 않은지를 구분하는 문제에서 단어들이 유용 리뷰 집합과 유용하지 않은 리뷰집합에 중복하여 등장하는 정도를 측정한 중립도를 제시한다. 제시한 중립도를 희소성과 함께 분석에 활용하여 제거할 단어를 선정한 후에 각 분류 알고리즘의 성과를 비교하였다. 최적의 성과를 보이는 중립도를 찾았으며, 희소성과 중립도에 따라 단어를 선택적으로 제거하였다. 실험은 Amazon.com의 'Cellphones & Accessories', 'Movies & TV program', 'Automotive', 'CDs & Vinyl', 'Clothing, Shoes & Jewelry' 제품 분야 고객 리뷰와 사용자들의 리뷰에 대한 평가를 활용하였다. 전체 득표의 수가 4개 이상인 리뷰 중에서 제품 카테고리 별로 유용하다고 판단되는 1,500개의 리뷰와 유용하지 않다고 판단되는 1,500개의 리뷰를 무작위로 추출하여 연구에 사용하였다. 데이터 집합에 따라 정확도 개선 정도가 상이하며, F-measure 기준으로는 두 알고리즘에서 모두 희소성과 중립도에 기반하여 단어를 제거하는 방안이 더 성과가 높았다. 하지만 Information Gain 알고리즘에서는 Recall 기준으로는 5개 제품 카테고리 데이터에서 언제나 희소성만을 기준으로 단어를 제거하는 방안의 성과가 높았으며, SVM에서는 전체 단어를 활용하는 방안이 Precision 기준으로 성과가 더 높았다. 따라서, 활용하는 알고리즘과 분석 목적에 따라서 단어 제거 방안을 고려하는 것이 필요하다.

Keywords

References

  1. Cao, Q., W. Duan and Q. Gan, "Exploring determinants of voting for the 'helpfulness' online userreviews: A text mining approach," Decision Support Systems, Vol.50, No.2(2011), 511-521. https://doi.org/10.1016/j.dss.2010.11.009
  2. Choeh, J. Y., H. J. Lee and S. J. Park, "A Personalized Approach for Recommending Useful Product Reviews Basedon Information Gain," KSII Transactions on Internet and Information Systems, Vol.9, No.5(2015), 1702-1716. https://doi.org/10.3837/tiis.2015.05.008
  3. Cruz, R. A. and H. J. Lee, "The Effects of Sentiment and Readability on Useful Votes for Customer Reviews with Count Type Review Usefulness Index," Journal of Intelligence and Information Systems, Vol.22, No.1(2016), 43-61. https://doi.org/10.13088/jiis.2016.22.1.043
  4. David, S. and T. Pinch, "Six Degrees of Reputation: The Use and Abuse of Online Review and Recommendation Systems," First Monday, Vol.11, No.3(2006), Available at http://dx.doi.org/10.5210/fm.v11i3.1315 (Downloaded 15 September, 2016)
  5. Dellarocas, C., "The Digitization of Word of Mouth: Promise and Challenges of Online Feedback Mechanisms," Management Science, Vol.49, No.10(2003), 1407-1424. https://doi.org/10.1287/mnsc.49.10.1407.17308
  6. Dellarocas, C., G. Gao and R. Narayan, "Are consumers more likelyto contribute online reviews for hit or niche products?," Journal of Management Information Systems, Vol.27, No.2(2010), 127-157. https://doi.org/10.2753/MIS0742-1222270204
  7. Feinerer, I., K. Hornik and D. Meyer, "TextMining Infrastructure in R," Journal of Statistical Software, Vol.25, No.5(2008), 1-54.
  8. Hong, E. S., "Earlv Software Oualitv Prediction Using Support Vector Machine," Journal of Information Technology Services, Vol.10, No.12(2011), 235-245.
  9. Lee, H. W. and H. C. Ahn, "An Intelligent Intrusion Detection Model Based on Support Vector Machines and the Classification Threshold Optimization for Considering the Asymmetric Error Cost," Journal of Intelligence and Information Systems, Vol.17, No.4(2011), 157-173.
  10. Lee, S. J., J. Y. Choeh and J. H. Choi, "The Determinant Factors Affecting Economic Impact, Helpfulness, and Helpfulness Votes of Online," Journal of Information Technology Services, Vol.13, No.1(2014), 43-55.
  11. Liu, Y., X. Huang, A. An and X. Yu, "Modeling andPredicting the Helpfulness of Online Reviews," Proceedings of the Eighth IEEE International Conference on Data Mining (2008), 443-452.
  12. McAuley, J., C. Targett, J. Shi and A. van den Hengel, "Image-based recommendations onstyles and substitutes," Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (2015), 43-52.
  13. McAuley, J., R. Pandey and J. Leskovec, "Inferringnetworks of substitutable and complementary products," Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015), 785-794.
  14. Naji, I., "10 Tips to Improve your TextClassification Algorithm Accuracy and Performance," Accessed at http://thinknook.com/10-ways-to-improve-your-classification-algorithm-performance-2013-01-21/
  15. Pak, A. and P. Paroubek, "Twitter as a Corpus for Sentiment Analysis and Opinion Mining," Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC '10)(2010).
  16. Park, S. C., S. W. Kim and H. S. Choi, "Selection Model of System Trading Strategies using SVM," Journal of Intelligence and Information Systems, Vol.20, No.2(2014), 59-71.
  17. Perkins, J., Python 3 Text Processing with NLTK 3Cookbook, Packt Publishing, 2014.
  18. Zhang, R. and T. Tran, "An Information gain-basedapproach for recommending useful product reviews," Knowledge and Information Systems, Vol.26, No.3(2011), 419-434. https://doi.org/10.1007/s10115-010-0287-y
  19. Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel and F. Leisch, "e1071: Misc Functionsof the Department of Statistics, Probability Theory Group (Formerly: E1071)," TUWien. R package version 1.6-7. https://CRAN.R-project.org/package=e1071, 2015.

Cited by

  1. 카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용 vol.23, pp.2, 2016, https://doi.org/10.13088/jiis.2017.23.2.123
  2. 온라인 상품평의 내용적 특성이 소비자의 인지된 유용성에 미치는 영향 vol.23, pp.3, 2017, https://doi.org/10.13088/jiis.2017.23.3.029
  3. 문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안 vol.25, pp.4, 2019, https://doi.org/10.13088/jiis.2019.25.4.105
  4. 온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여 vol.26, pp.1, 2016, https://doi.org/10.13088/jiis.2020.26.1.097