DOI QR코드

DOI QR Code

A New Fine-grain SMS Corpus and Its Corresponding Classifier Using Probabilistic Topic Model

  • Ma, Jialin (College of Computer and Information, Hohai University) ;
  • Zhang, Yongjun (College of Computer and Information, Hohai University) ;
  • Wang, Zhijian (College of Computer and Information, Hohai University) ;
  • Chen, Bolun (Huaiyin Institute of Technology)
  • Received : 2017.03.26
  • Accepted : 2017.10.24
  • Published : 2018.02.28

Abstract

Nowadays, SMS spam has been overflowing in many countries. In fact, the standards of filtering SMS spam are different from country to country. However, the current technologies and researches about SMS spam filtering all focus on dividing SMS message into two classes: legitimate and illegitimate. It does not conform to the actual situation and need. Furthermore, they are facing several difficulties, such as: (1) High quality and large-scale SMS spam corpus is very scarce, fine categorized SMS spam corpus is even none at all. This seriously handicaps the researchers' studies. (2) The limited length of SMS messages lead to lack of enough features. These factors seriously degrade the performance of the traditional classifiers (such as SVM, K-NN, and Bayes). In this paper, we present a new fine categorized SMS spam corpus which is unique and the largest one as far as we know. In addition, we propose a classifier, which is based on the probability topic model. The classifier can alleviate feature sparse problem in the task of SMS spam filtering. Moreover, we compare the approach with three typical classifiers on the new SMS spam corpus. The experimental results show that the proposed approach is more effective for the task of SMS spam filtering.

Keywords

References

  1. Ahmed, I., Ali, R., Guan, D., Lee, Y.-K., Lee, S., & Chung, T., "Semi-supervised learning sing frequent itemset and ensemble learning for SMS classification," Expert Systems with Applications, 42(3), 1065-1073, 2015. https://doi.org/10.1016/j.eswa.2014.08.054
  2. Almeida, T., Hidalgo, J. M. G., & Silva, T. P., "Towards sms spam filtering: Results under a new dataset," International Journal of Information Security Science, 2(1), 1-18, 2013.
  3. Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A., " Contributions to the study of SMS spam filtering: new collection and results," in Proc. of Paper presented at the Proceedings of the 11th ACM symposium on Document engineering, 2011.
  4. Blei, D. M., "Probabilistic topic models," Communications of the ACM, 55(4), 77-84, 2012. https://doi.org/10.1145/2133806.2133826
  5. Blei, D. M., Ng, A. Y., & Jordan, M. I., Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022, 2003.
  6. Chan, P. P. K., Yang, C., Yeung, D. S., and Ng, W. W. Y., "Spam filtering for short messages in adversarial environment," Neurocomputing, 155, 167-176, 2015. https://doi.org/10.1016/j.neucom.2014.12.034
  7. Chemudugunta, C., Smyth, P., Steyvers, M., "Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model," MIT Press, Vol. 19, 2007.
  8. Chen, T., and Kan, M.-Y, "Creating a live, public short message service corpus: the NUS SMS corpus," Language Resources and Evaluation, vol. 47, no. 2, 299-335, 2013. https://doi.org/10.1007/s10579-012-9197-9
  9. Cormack, G. V., Gomez Hidalgo, J. M., and Sanz, E. P., "Spam filtering for short messages," in Proc. of Paper presented at the Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, p. 313-320, 2007.
  10. Cormack, G. V., Hidalgo, J. M. G., and Sanz, E. P., "Feature engineering for mobile (SMS) spam filtering," Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 871-872, 2007.
  11. Delany, S. J., Buckley, M., and Greene, D., "SMS spam filtering: methods and data," Expert Systems with Applications, vol. 39, no. 10, 9899-9908, 2012. https://doi.org/10.1016/j.eswa.2012.02.053
  12. Deng, J., Xia, H., Fu, Y., Zhou, J., and Xia, Q., "Intelligent spam filtering for massive short message stream," COMPEL - The international journal for computation and mathematics in electrical and electronic engineering, vol. 32, no. 2, 586-596, 2013. https://doi.org/10.1108/03321641311296963
  13. Endres, D. M., & Schindelin, J. E., "A new metric for probability distributions," IEEE Transactions on Information theory, vol. 49, no. 7, 2003.
  14. Gomez Hidalgo, J. M., Bringas, G. C., Sanz, E. P., and Garcia, F. C., "Content based SMS spam filtering," in Proc. of Paper presented at the Proceedings of the 2006 ACM symposium on Document engineering, p. 107-114, 2006.
  15. Heinrich G., "Parameter estimation for text analysis," Technical Report, 2004.
  16. Hidalgo, J. M. G., Almeida, T., and Yamakami, A., "On the validity of a new SMS spam Collection," in Proc. of Paper presented at the Machine Learning and Applications (ICMLA), 2012 11th International Conference on, 2012.
  17. Ho, T. P., Kang, H.-S., and Kim, S.-R., "Graph-based KNN Algorithm for Spam SMS Detection," J. UCS, vol. 19, no. 16, 2404-2419, 2013.
  18. Hofmann T., "Probabilistic latent semantic indexing," in Proc. of Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 1999.
  19. Hong, L., and Davison, B. D, "Empirical study of topic modeling in Twitter," in Proc. of Proceedings of the Sigkdd Workshop on Social Media Analytics, 80-88, 2010.
  20. Hu, X., & Yan, F., "Sampling of mass SMS filtering algorithm based on frequent time-domain area," in Proc. of Kwledge Discovery and Data Mining, 2010. WKDD '10. Third International Conference on, 2010.
  21. Jiang, N., Jin, Y., Skudlark, A., and Zhang, Z.-L, "Understanding sms spam in a large cellular network: characteristics, strategies and defenses," Research in Attacks, Intrusions, and Defenses, pp. 328-347, Springer, 2013.
  22. Kang, S.-S, "A Normalization Method of Distorted Korean SMS Sentences for Spam Message Filtering," KIPS Transactions on Software and Data Engineering, vol. 3, no. 7, 271-276, 2014. https://doi.org/10.3745/KTSDE.2014.3.7.271
  23. Liu, W., and Wang, T. x., "Index-based Online Text Classification for SMS Spam Filtering," Journal of Computers, vol. 5, no. 6, 2010.
  24. Modupe, A., Olugbara, O. O., & Ojo, S. O., "Investigating topic models for mobile short messaging service communication filtering," Paper presented at the Proceedings of the World Congress on Engineering, 2013.
  25. Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P., "The author-topic model for authors and documents," in Proc. of Paper presented at the Proceedings of the 20th conference on Uncertainty in artificial intelligence, p. 487-494, 2004.
  26. Sohn, D.-N., Lee, J.-T., and Rim, H.-C, "The contribution of stylistic information to content-based mobile spam filtering," in Proc. of Paper presented at the Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, p. 321-324, 2009.
  27. Thomas K. Landauer, P. W. F., Darrell Laham, "An Introduction to Latent Semantic Analysis," Discourse Processes, vol. 25, p. 259-284, 1998. https://doi.org/10.1080/01638539809545028
  28. Wadhawan, A., & Negi, N., "A Novel Approach For Generating Rules For SMS Spam Filtering Using Rough Sets," International Journal of Scientific & Technology Research, 3(7), p. 80-86, 2014.
  29. Wu, N., Wu, M., and Chen, S, "Real-time monitoring and filtering system for mobile SMS," in Proc. of IEEE Conference on Industrial Electronics & Applications, p. 1319 - 1324, 2008.
  30. Yan, X., Guo, J., Lan, Y., and Cheng, X., "A biterm topic model for short texts," in Proc. of Paper presented at the Proceedings of the 22nd international conference on World Wide Web, 2013.
  31. Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., and Li, X., "Comparing Twitter and Traditional Media Using Topic Models," Paper presented at the In ECIR, p. 338-349, 2011.