DOI QR코드

DOI QR Code

The Detection of Online Manipulated Reviews Using Machine Learning and GPT-3

기계학습과 GPT3를 시용한 조작된 리뷰의 탐지

  • Chernyaeva, Olga (College of Business Administration, Pusan National University) ;
  • Hong, Taeho (College of Business Administration, Pusan National University)
  • 체르냐예바 올가 (부산대학교 경영학부) ;
  • 홍태호 (부산대학교 경영학부)
  • Received : 2022.11.20
  • Accepted : 2022.12.15
  • Published : 2022.12.31

Abstract

Fraudulent companies or sellers strategically manipulate reviews to influence customers' purchase decisions; therefore, the reliability of reviews has become crucial for customer decision-making. Since customers increasingly rely on online reviews to search for more detailed information about products or services before purchasing, many researchers focus on detecting manipulated reviews. However, the main problem in detecting manipulated reviews is the difficulties with obtaining data with manipulated reviews to utilize machine learning techniques with sufficient data. Also, the number of manipulated reviews is insufficient compared with the number of non-manipulated reviews, so the class imbalance problem occurs. The class with fewer examples is under-represented and can hamper a model's accuracy, so machine learning methods suffer from the class imbalance problem and solving the class imbalance problem is important to build an accurate model for detecting manipulated reviews. Thus, we propose an OpenAI-based reviews generation model to solve the manipulated reviews imbalance problem, thereby enhancing the accuracy of manipulated reviews detection. In this research, we applied the novel autoregressive language model - GPT-3 to generate reviews based on manipulated reviews. Moreover, we found that applying GPT-3 model for oversampling manipulated reviews can recover a satisfactory portion of performance losses and shows better performance in classification (logit, decision tree, neural networks) than traditional oversampling models such as random oversampling and SMOTE.

고객의 구매 의사결정에 영향을 주는 온라인 리뷰의 부적절한 조작을 통해 이익을 얻고자 하는 기업 또는 온라인 판매자들 때문에, 리뷰의 신뢰성은 온라인 거래에서 매우 중요한 이슈가 되었다. 온라인 쇼핑몰 등에서 온라인 리뷰에 대한 소비자들의 의존도가 높아짐에 따라 많은 연구들이 조작된 리뷰를 탐지하는 방법에 개발하고자 하였다. 기존의 연구들은 온라인 리뷰를 기반으로 정상 리뷰와 조작된 리뷰를 대상으로 기계학습으로 이용함으로써 조작된 리뷰를 탐지하는 모형을 제시하였다. 기계학습은 데이터를 이용하여 이진분류 문제에서 탁월한 성능을 보여왔으나, 학습에 충분한 데이터를 확보할 수 있는 환경에서만 이러한 성능을 기대할 수 있었다. 조작된 리뷰는 학습용으로 사용할 수 있는 데이터가 충분하지 못하며, 이는 기계학습이 충분한 학습을 할 수 없다는 치명적 약점으로 내포하게 된다. 본 연구에서는 기계학습이 불균형 데이터 셋으로 인한 학습의 저하를 방지할 수 있는 방안으로 부족한 조작된 리뷰를 인공지능을 이용하여 생성하고 이를 기반으로 균형된 데이터 셋에서 기계학습을 학습하여 조작된 리뷰를 탐지하는 방안을 제시하였다. 파인 튜닝된 GPT-3는 초거대 인공지능으로 온라인 플랫폼의 리뷰를 생성하여 데이터 불균형 문제를 해결하는 오버샘플링 접근방법으로 사용되었다. GPT-3로 생성한 온라인 리뷰는 기존 리뷰를 기반으로 인공지능이 작성한 리뷰로써, 본 연구에서 사용된 로짓, 의사결정나무, 인공신경망의 성능을 개선시키는 것을 SMOTE와 단순 오버샘플링과 비교하여 실증분석을 통해서 확인하였다.

Keywords

Acknowledgement

This work was supported by a 2-Year Research Grant of Pusan National University.

References

  1. Anderson, E. T., & Simester, D. I. (2014). Reviews without a purchase: Low ratings, loyal customers, and deception. Journal of Marketing Research, 51(3), 249-269. https://doi.org/10.1509/jmr.13.0209
  2. Ball, L., & Elworthy, J. (2014). Fake or real? The computational detection of online deceptive text. Journal of Marketing Analytics, 2(3), 187-201. https://doi.org/10.1057/jma.2014.15
  3. Banerjee, S., & Chua, A. Y. (2014). A theoretical framework to identify authentic online reviews. Online Information Review.
  4. Banerjee, S., Bhattacharyya, S., & Bose, I. (2017). Whose online reviews to trust? Understanding reviewer trustworthiness and its impact on business. Decision Support Systems, 96, 17-26 https://doi.org/10.1016/j.dss.2017.01.006
  5. Cao, Q., Duan, W., & Gan, Q. (2011). Exploring determinants of voting for the "helpfulness" of online user reviews: A text mining approach. Decision Support Systems, 50(2), 511-521. https://doi.org/10.1016/j.dss.2010.11.009
  6. Chen, L. S., & Lin, J. Y. (2013, July). A study on review manipulation classification using decision tree. In 2013 10th international conference on service systems and service management (pp. 680-685). IEEE.
  7. Cheng, Y. H., & Ho, H. Y. (2015). Social influence's impact on reader perceptions of online reviews. Journal of Business Research, 68(4), 883-887. https://doi.org/10.1016/j.jbusres.2014.11.046
  8. Crawford, M., Khoshgoftaar, T. M., Prusa, J. D., Richter, A. N., & Al Najada, H. (2015). Survey of review spam detection using machine learning techniques. Journal of Big Data, 2(1), 1-24. https://doi.org/10.1186/s40537-014-0007-7
  9. Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1-20. https://doi.org/10.1016/j.ins.2018.06.056
  10. Dwivedi, Y. K., Ismagilova, E., Hughes, D. L., Carlson, J., Filieri, R., Jacobson, J., ... & Wang, Y. (2021). Setting the future of digital and social media marketing research: Perspectives and research propositions. International Journal of Information Management, 59, 102168. https://doi.org/10.1016/j.ijinfomgt.2020.102168
  11. Eslami, S. P., & Ghasemaghaei, M. (2018). Effects of online review positiveness and review score inconsistency on sales: A comparison by product involvement. Journal of Retailing and Consumer Services, 45, 74-80. https://doi.org/10.1016/j.jretconser.2018.08.003
  12. Fernandez, A., Garcia, S., Luengo, J., Bernado-Mansilla, E., & Herrera, F. (2010). Genetics-based machine learning for rule induction: state of the art, taxonomy, and comparative study. IEEE Transactions on Evolutionary Computation, 14(6), 913-941. https://doi.org/10.1109/TEVC.2009.2039140
  13. Filieri, R. (2015). What makes online reviews helpful? A diagnosticity-adoption framework to explain informational and normative influences in e-WOM. Journal of business research, 68(6), 1261-1270. https://doi.org/10.1016/j.jbusres.2014.11.006
  14. Gobi, N., & Rathinavelu, A. (2019). Analyzing cloud based reviews for product ranking using feature based clustering algorithm. Cluster Computing, 22(3), 6977-6984. https://doi.org/10.1007/s10586-018-1996-3
  15. Gossling, S., Hall, C. M., & Andersson, A. C. (2018). The manager's dilemma: a conceptualization of online review manipulation strategies. Current Issues in Tourism, 21(5), 484-503. https://doi.org/10.1080/13683500.2015.1127337
  16. He, S., Hollenbeck, B., & Proserpio, D. (2022). The market for fake reviews. Marketing Science.
  17. Hu, N., Bose, I., Koh, N. S., & Liu, L. (2012). Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decision support systems, 52(3), 674-684. https://doi.org/10.1016/j.dss.2011.11.002
  18. Hu, N., Liu, L., & Sambamurthy, V. (2011). Fraud detection in online consumer reviews. Decision Support Systems, 50(3), 614-626. https://doi.org/10.1016/j.dss.2010.08.012
  19. Ismagilova, E., Slade, E., Rana, N. P., & Dwivedi, Y. K. (2020). The effect of characteristics of source credibility on consumer behaviour: A meta-analysis. Journal of Retailing and Consumer Services, 53, 101736. https://doi.org/10.1016/j.jretconser.2019.01.005
  20. Jalther, D., & Priya, G. (2019). Reputation reporting system using text based classification. Int. J. Innov. Technol. and Expl. Eng., 8(8), 1555-1558.
  21. Khurshid, F., Zhu, Y., Xu, Z., Ahmad, M., & Ahmad, M. (2019). Enactment of ensemble learning for review spam detection on selected features. International Journal of Computational Intelligence Systems, 12(1), 387-394. https://doi.org/10.2991/ijcis.2019.125905655
  22. Kim, J., & Kwahk, K.-Y. (2022). Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation. Journal of Intelligence and Information Systems, 28(3), 23-43. https://doi.org/10.13088/JIIS.2022.28.3.023
  23. Kim, M. J., Kang, D. K., & Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3), 1074-1082. https://doi.org/10.1016/j.eswa.2014.08.025
  24. Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS international transactions on computer science and engineering, 30(1), 25-36.
  25. Kumar, A., Gopal, R. D., Shankar, R., & Tan, K. H. (2022). Fraudulent review detection model focusing on emotional expressions and explicit aspects: investigating the potential of feature engineering. Decision Support Systems, 155, 113728. https://doi.org/10.1016/j.dss.2021.113728
  26. Li, H., Li, J., Chang, P. C., & Sun, J. (2013). Parametric prediction on default risk of Chinese listed tourism companies by using random oversampling, isomap, and locally linear embeddings on imbalanced samples. International Journal of Hospitality Management, 35, 141-151. https://doi.org/10.1016/j.ijhm.2013.06.006
  27. Li, L., Qin, B., Ren, W., & Liu, T. (2017). Document representation and feature combination for deceptive spam review detection. Neurocomputing, 254, 33-41. https://doi.org/10.1016/j.neucom.2016.10.080
  28. Li, X., Yun, H., Li, Q., & Kim, J. (2022). A multi-channel CNN based online review helpfulness prediction model. Journal of Intelligence and Information Systems, 28(2), 171-189. https://doi.org/10.13088/JIIS.2022.28.2.171
  29. Liang, Y., & Zhu, K. (2018, April). Automatic generation of text descriptive comments for code blocks. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
  30. Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B., & Lauw, H. W. (2010, October). Detecting product review spammers using rating behaviors. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 939-948).
  31. Liu, Y., Pang, B., & Wang, X. (2019). Opinion spam detection by incorporating multimodal embedded representation into a probabilistic review graph. Neurocomputing, 366, 276-283. https://doi.org/10.1016/j.neucom.2019.08.013
  32. Lopez, V., Fernandez, A., Garcia, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information sciences, 250, 113-141. https://doi.org/10.1016/j.ins.2013.07.007
  33. Luca, M. (2016). Reviews, reputation, and revenue: The case of Yelp. com. Com (March 15, 2016). Harvard Business School NOM Unit Working Paper, (12-016).
  34. Majumdar, S., Kulkarni, D., & Ravishankar, C. V. (2007, May). Addressing click fraud in content delivery systems. In IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications (pp. 240-248). IEEE.
  35. Mayzlin, D., Dover, Y., & Chevalier, J. (2014). Promotional reviews: An empirical investigation of online review manipulation. American Economic Review, 104(8), 2421-55. https://doi.org/10.1257/aer.104.8.2421
  36. Mouratidis, D., Nikiforos, M. N., & Kermanidis, K. L. (2021). Deep learning for fake news detection in a pairwise textual input schema. Computation, 9(2), 20. https://doi.org/10.3390/computation9020020
  37. Nunamaker Jr, J. F., Burgoon, J. K., & Giboney, J. S. (2016). Information systems for deception detection. Journal of Management Information Systems, 33(2), 327-331. https://doi.org/10.1080/07421222.2016.1205928
  38. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557. https://doi.org/10.48550/arXiv.1107.4557
  39. Park, Y.-J., & Kim, K.-j. (2017). Impact of Semantic Characteristics on Perceived Helpfulness of Online Reviews. Journal of Intelligence and Information Systems, 23(3), 29-44. https://doi.org/10.13088/JIIS.2017.23.3.029
  40. Rajamohana, S. P., & Umamaheswari, K. (2018). Hybrid approach of improved binary particle swarm optimization and shuffled frog leaping for feature selection. Computers & Electrical Engineering, 67, 497-508. https://doi.org/10.1016/j.compeleceng.2018.02.015
  41. Rajamohana, S. P., Umamaheswari, K., & Abirami, B. (2017). Performance analysis of iBPSO and BFPA based feature selection techniques for improving classification accuracy in review spam detection. Appl. Math, 11(4), 1149-1153.
  42. Ren, Y., & Ji, D. (2017). Neural networks for deceptive opinion spam detection: An empirical study. Information Sciences, 385, 213-224. https://doi.org/10.1016/j.ins.2017.01.015
  43. Salminen, J., Kandpal, C., Kamel, A. M., Jung, S. G., & Jansen, B. J. (2022). Creating and detecting fake reviews of online products. Journal of Retailing and Consumer Services, 64, 102771. https://doi.org/10.1016/j.jretconser.2021.102771
  44. Scott, K. (2020). Microsoft teams up with OpenAI to exclusively license GPT-3 language model. Official Microsoft Blog.
  45. Shmueli, G., Patel, N. R., & Bruce, P. C. (2011). Data mining for business intelligence: Concepts, techniques, and applications in Microsoft Office Excel with XLMiner. John Wiley and Sons.
  46. Suh, Y., Yu, J., Mo, J., Song, L., & Kim, C. (2017). A comparison of oversampling methods on imbalanced topic classification of Korean news articles. Journal of Cognitive Science, 18(4), 391-437. https://doi.org/10.17791/jcs.2017.18.4.391
  47. Tian, K., Shao, M., Wang, Y., Guan, J., & Zhou, S. (2016). Boosting compound-protein interaction prediction by deep learning. Methods, 110, 64-72. https://doi.org/10.1016/j.ymeth.2016.06.024
  48. Veganzones, D., & Severin, E. (2018). An investigation of bankruptcy prediction in imbalanced datasets. Decision Support Systems, 112, 111-124. https://doi.org/10.1016/j.dss.2018.06.011
  49. Weisberg, J., Te'eni, D., & Arman, L. (2011). Past purchase and intention to purchase in e-commerce: The mediation of social presence and trust. Internet research.
  50. Yelp Trust & Safety. Trust & Safety Report. https://trust.yelp.com/trust-and-safety-report/
  51. Zhang, D., Zhou, L., Kehoe, J. L., & Kilic, I. Y. (2016). What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. Journal of Management Information Systems, 33(2), 456-481. https://doi.org/10.1080/07421222.2016.1205907