Context-based classification for harmful web documents and comparison of feature selecting algorithms

  • Kim, Young-Soo (Cryptography Research Team, Electronics and Telecommunications Research Institue) ;
  • Park, Nam-Je (RFID/USN Security Research Team, Electronics and Telecommunications Research Institue) ;
  • Hong, Do-Won (Cryptography Research Team, Electronics and Telecommunications Research Institue) ;
  • Won, Dong-Ho (School of Information and Communication Engineering, Sungkyunkwan University)
  • Published : 2009.06.30

Abstract

More and richer information sources and services are available on the web everyday. However, harmful information, such as adult content, is not appropriate for all users, notably children. Since internet is a worldwide open network, it has a limit to regulate users providing harmful contents through each countrie's national laws or systems. Additionally it is not a desirable way of developing a certain system-specific classification technology for harmful contents, because internet users can contact with them in diverse ways, for example, porn sites, harmful spams, or peer-to-peer networks, etc. Therefore, it is being emphasized to research and develop context-based core technologies for classifying harmful contents. In this paper, we propose an efficient text filter for blocking harmful texts of web documents using context-based technologies and examine which algorithms for feature selection, the process that select content terms, as features, can be useful for text categorization in all content term occurs in documents, are suitable for classifying harmful contents through implementation and experiment.

Keywords

References

  1. F.Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, Vol.43, No.1, pp. 1-47, 2002.
  2. C.Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, Vol.2, pp. 121-167, 1998. https://doi.org/10.1023/A:1009715923555
  3. W.Frakes and R.Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prentice Hall, 1992.
  4. Thesaurus, Wikipedia: the Free Encyclopedia, http://en.wikipedia.org/wiki/Thesaurus,2008.
  5. G.Siolas and F.d'Alche-Buc, "Support Vector Machines based on a Semantic Kernel for Text Categorization," Proceeding of IJCNN 2000, Vol.5, pp. 205-209, 2000.
  6. Support Vector Machine, Wikipedia, the free Encyclopedia, http://en.wikipedia.org/wiki/SVM,2005.
  7. Y.Yang and J.Pederson, "A Comparative Study on Feature Selection in text Categorization," Proceedings of the 14th International Conference on Machine Learning, pp.412-420, 1997.
  8. S.Kang, Korean Morphological Analysis and Information Retrieval, Hongreung Science Press, 2002.