DOI QR코드

DOI QR Code

Analyzing the Effect of Lexical and Conceptual Information in Spam-mail Filtering System

  • Kang Sin-Jae (School of Computer and Information Technology, Daegu University) ;
  • Kim Jong-Wan (School of Computer and Information Technology, Daegu University)
  • Published : 2006.06.01

Abstract

In this paper, we constructed a two-phase spam-mail filtering system based on the lexical and conceptual information. There are two kinds of information that can distinguish the spam mail from the ham (non-spam) mail. The definite information is the mail sender's information, URL, a certain spam keyword list, and the less definite information is the word list and concept codes extracted from the mail body. We first classified the spam mail by using the definite information, and then used the less definite information. We used the lexical information and concept codes contained in the email body for SVM learning in the 2nd phase. According to our results the ham misclassification rate was reduced if more lexical information was used as features, and the spam misclassification rate was reduced when the concept codes were included in features as well.

Keywords

References

  1. L. F. Cranor, and B. A. LaMacchia, 'Spam!,' Communications of ACM, vol.41, no.8, pp. 74-83, 1998
  2. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, 'A bayesian approach to filtering junk e-mail,' In AAAI-98 Workshop on Learning for Text Categorization, pp. 55-62, 1998
  3. V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995
  4. H. Drucker, D. Wu, and V. Vapnik, 'Support Vector Machines for Spam Categorization,' IEEE Trans. on Neural Networks, vol.10, no.5, pp. 1048-1054, 1999 https://doi.org/10.1109/72.788645
  5. T. Joachims, 'Text Categorization with Support Vector Machines: Learning with Many Relevant Features,' ECML, Claire Nedellec and Celine Rouveirol (ed.), 1998
  6. J. Yang, V. Chalasani, and S. Park, 'Intelligent categorization based on textual information metadata,' IEICE Transactions on information System, vol.E86-D, no.7, pp. 1280-1288, 2003
  7. Kim, J. W., Kim, H. J., Kang, S. J., and Kim, B. M., 'Determination of Usenet News Groups by Fuzzy Inference and Kohonen Network,' Lecture Notes in Artificial Intelligence, vol.3157, Springer-Verlag, pp. 654-663, 2004
  8. S. Ohno, and M. Hamanishi, New Synonyms Dictionary, Kadokawa Shoten, Tokyo, 1981
  9. C. J. Park, J. H. Lee, G. B. Lee, and K. Kakechi, 'Collocation-Based Transfer Method in Japanese-Korean Machine Translation,' Transaction of information Processing Society of Japan, vol.38, no.4, pp. 707-718, 1997
  10. K. H. Moon, and J. H. Lee, 'Representation and Recognition Method for Multi-Word Translation Units in Korean-to-Japanese MT System,' In the 18th International Conference on Computational Linguistics (COLING 2000), Germany, pp. 544-550, 2000
  11. H. F. Li, N. W. Heo, K. H. Moon, J. H. Lee, and G. B. Lee, 'Lexical Transfer Ambiguity Resolution Using Automatically-Extracted Concept Co-occurrence Information,' International Journal of Computer Processing of Oriental Languages, World Scientific Pub., vol.13, no. 1 , pp. 53-68, 2000 https://doi.org/10.1016/S0219-4279(00)00005-3
  12. I. H. Witten, and E. Frank, Data Mining: Practical machine learning tools and Techniques with java implementations, Morgan Kaufmann, 2000
  13. Gordon V. Cormack, Overview of the TREC 2005 Spam Track, http://plg.uwaterloo.ca/~gvconnac/ trecspamtrack05, 2005
  14. P. J. Resnick, D. L. Hansen, and C. R. Richardson, 'Calculating Error Rates for Filtering Software,' Communications of ACM, vol.47, no.9, pp. 67-71, 2004 https://doi.org/10.1145/1015864.1015865