DOI QR코드

DOI QR Code

트레이닝 데이터가 제한된 환경에서 N-Gram 사전을 이용한 트위터 스팸 탐지 방법

A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling

  • 최혁준 (충남대학교 컴퓨터공학과) ;
  • 박정희 (충남대학교 컴퓨터공학과)
  • 투고 : 2017.07.11
  • 심사 : 2017.08.12
  • 발행 : 2017.09.30

초록

본 논문에서는 트레이닝 데이터가 제한된 환경에서 n-gram 사전을 이용하여 불건전 정보를 포함하는 스팸 트윗을 탐지하는 방법을 제안한다. 불건전 정보를 포함하는 스팸 트윗은 유사한 단어와 문장을 사용하는 경향이 있다. 이러한 특성을 이용하여 스팸 트윗과 정상 트윗에 대한 n-gram 사전을 구축하고 나이브 베이스 분류기를 적용하여 효과적으로 스팸 트윗을 탐지할 수 있음을 보인다. 반면에, 실시간으로 대용량의 데이터가 유입되는 트위터의 특성은 초기 트레이닝 집합 구성에 매우 큰 비용을 요구 한다. 따라서, 초기 트레이닝 집합이 매우 작거나 존재하지 않는 환경에서 적용할 수 있는 스팸 트윗 탐지 방법이 필요하다. 이를 위해 트위터의 리트윗 기능을 활용하여 의사 라벨을 생성하고 초기 트레이닝 집합의 구성과 n-gram 사전 업데이트에 활용하는 방법을 제안한다. 2016년 12월 1일부터 2016년 12월 7일까지 수집된 한국어 트윗 130만 건을 사용한 다양한 실험 결과는 비교 방법들보다 제안하는 방법의 성능이 우수함을 입증한다.

In this paper, we propose a method to detect spam tweets containing unhealthy information by using an n-gram dictionary under limited labeling. Spam tweets that contain unhealthy information have a tendency to use similar words and sentences. Based on this characteristic, we show that spam tweets can be effectively detected by applying a Naive Bayesian classifier using n-gram dictionaries which are constructed from spam tweets and normal tweets. On the other hand, constructing an initial training set requires very high cost because a large amount of data flows in real time in a twitter. Therefore, there is a need for a spam detection method that can be applied in an environment where the initial training set is very small or non exist. To solve the problem, we propose a method to generate pseudo-labels by utilizing twitter's retweet function and use them for the configuration of the initial training set and the n-gram dictionary update. The results from various experiments using 1.3 million korean tweets collected from December 1, 2016 to December 7, 2016 prove that the proposed method has superior performance than the compared spam detection methods.

키워드

참고문헌

  1. Statista, Number of Monthly Active Twitter Users Worldwide from 1st quarter 2010 to 4th quarter 2016 (in millions) [Internet], https://www.statista.com/statistics/282087/numberof-monthly-active-twitter-users/.
  2. David Sayce, Number of tweets per day? [Internet], http://www.dsayce.com/social-media/tweets-day/.
  3. L. M. Aiello et al., "Sensing Trending Topics in Twitter," IEEE Trans. Multimedia., Vol.15, No.6, pp.1268-1282, 2013. https://doi.org/10.1109/TMM.2013.2265080
  4. T. Sakaki, M. Okazaki, and Y. Matsuo, "Earthquake Shakes Twitter Users: Real-Time Event Detection by Social Sensors," in Proc. 19th International Conference on World Wide Web, ACM, pp. 851-860, 2010.
  5. A. I. Baqapuri, S. Saleh, M. U. Ilyas, "Sentiment Classification of Tweets using Hierarchical Classification," in Proc. IEEE International Conference on Communications, IEEE, 2016.
  6. Neal Ungerleider, Almost 10% of Twitter Is Spam [Internet], https://www.fastcompany.com/3044485/almost-10-of-twitter-is-spam/.
  7. Judy Mottl, Twitter acknowledges 23 million active users are actually bots [Internet], http://www.techtimes.com/articles/12840/20140812/twitter-acknowledges-14-percent-users-bot s-5-percent-spam-bots.htm/.
  8. C. Chen, J. Zhang, Y. Xiang, W. Zhou, and J. Oliver, "Spammers Are Becoming "Smarter" on Twitter," IEEE Trans. IT Professional., Vol.18, No.2, pp.66-70, 2016.
  9. H. J. Choi and C. H. Park, "A Twitter Spam Detection Method based on n-gram Dictionary," in Proc. Korea Computer Congress, Jeju, pp.227-229, 2017.
  10. K. Tao, F. Abel, C. Hauff, G. J. Houben, and U. Gadiraju, "Groundhog Day: Near-Duplicate Detection on Twitter," in Proc. 22nd International Conference on World Wide Web, ACM, pp.1273-1284, 2013.
  11. K. M. Lee, J. Caverlee, and S. Webb, "Uncovering social spammers : social honeypots + machine learning," in Proc. 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp.435-442, 2010.
  12. F. Benevenuto, G. magno, T. Rodrigues, and V. Almeida, "Detecting spammers on Twitter," Presented at the 7th annual Collaboration Electronic Messaging Anti-Abuse Spam Conference (CEAS), Vol.6, 2010.
  13. A. H. Wang, "Don't follow me : spam detection in twitter," in Proc. International Conference on Security and Cryptography (SECRYPT), 2010.
  14. S. Liu, J. Zhang, and Y. Xiang, "Statistical Detection of Online Drifting Twitter Spam," in Proc. 11th ACM on Asia Conference on Computer and Communications Security, ACM, pp.1-10, 2016.
  15. C. Chen, et al, "A Performance Evaluation of Machine Learning-Based Streaming Spam Tweet Detection," IEEE Trans. Computational Social Systems, Vol.2, No.3, pp.65-75. 2015. https://doi.org/10.1109/TCSS.2016.2516039
  16. C. Chen, J. Zhang, Y. Xiang, and W. Zhou, "Asymmetric Self-Learning for Tackling Twitter Spam Drift," in Proc. IEEE Conference on Computer Communications Workshops, IEEE, pp.208-213, 2015.
  17. G. Stringhini, C. Kruegel, and G. Vigna, "Detecting spammers on social networks," in Proc. 26th Annual Computer Security Applications Conference, ACM, pp.1-9, 2010.
  18. J. Song, S. Lee, and J. Kim, "Spam filtering in Twitter using sender-reeiver relationship," in Proc. 14th International Conference on Recent Advances in Intrusion Detection, Springer Berlin/Heidelberg, pp.301-317, 2011.
  19. C. Yang, R. Harkreader, and G. Gu, "Empirical evaluation and new design for fighting evolving twitter spammers," IEEE Trans. Information Forensics and Security, Vol.8, No. 8, pp.1280-1293, 2013. https://doi.org/10.1109/TIFS.2013.2267732
  20. K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, "Design and evaluation of a real-time URL spam filtering service," in Proc. IEEE Symposium on Security and Privacy, Washington, pp.447-462, 2011.
  21. S. H. Lee and J. Kim, "Warningbird : A near real-time detection system for suspicious URLs in Twitter spammers," IEEE Trans. Information Forensics and Security, Vol.8, No. 8, pp.1280-1293, 2013 https://doi.org/10.1109/TIFS.2013.2267732
  22. D. M. Freeman, "Using Naive Bayes to Detect Spammy Names in Social Networks," in Proc. the 2013 ACM Workshop on Artificial Intelligence and Security, ACM, pp. 3-12, 2013
  23. A. Herdagdelen, "Twitter n-gram corpus with demographic metadata," Language Resources and Evaluation, Vol.47, No. 4, pp.1127-1147, 2013. https://doi.org/10.1007/s10579-013-9227-2
  24. S. J. Lee and D. J. Choi, "Personalized Mobile Junk Message Filtering System," The Journal of the Korea Contents Association, Vol.11, No.12, pp.122-135, 2010. https://doi.org/10.5392/JKCA.2011.11.12.122
  25. H. N. Lee, M. G. Song, and E. G. Im, "A Study on Structuring Spam Short Message Service(SMS) filter," in Proc. Symposium of the Korean Institute of communications and Information Sciences, pp.1072-1073, 2011.
  26. S. W. Lee, "Spam Filter by Using X2 Statistics and Support Vector Machines," KIPS Journal B (2001-2012), Vol.17B, No.3, pp.249-254, 2010.
  27. I. W. Joe and H. T. Shim, "A SVM-based Spam Filtering System for Short Message Service (SMS)," The Journal of The Korean Institute of Communication Sciences, Vol.34, No.9, pp.908-913, 2009.
  28. Y. H. Kim et al., "Spam Twit Filtering using NaÏve Bayesian Algorithm and URL Analysis," in Proc. Korean Institute of Information Scientists and Engineers, Vol.38, No.2B, pp. 375-378, Nov., 2011.
  29. Twitter, Inc., Streaming APIs [Internet], https://dev.twitter.com/streaming/overview.
  30. Cyren, Q3 Trend Report Highlights Real-Time Malware Campagigns And Increase In Phishing [Internet], https://blog.cyren.com/articles/commtouch-internet-threats-trendreport-q3-2013.html.
  31. V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes-Which Naive Bayes?," in Proc. the Third Conference on Email and Anti-Spam, pp.28-69, 2006.
  32. J. Graovac, "Text Categorization Using n-Gram Based Language Independent Techniques," in Proc. 35th Anniversary of Computational Linguistics, pp.124-135, 2014.
  33. Machine Learning Group at the University of Waikato, Weka3: Data Mining Software in Java [Internet], http://www.cs.waikato.ac.nz/ml/weka/.