DOI QR코드

DOI QR Code

A Hierarchical Text Rating System for Objectionable Documents

  • Published : 2005.12.01

Abstract

In this paper, we classified the objectionable texts into four rates according to their harmfulness and proposed the hierarchical text rating system for objectionable documents. Since the documents in the same category have similarities in used words, expressions and structure of the document, the text rating system, which uses a single classification model, has low accuracy. To solve this problem, we separate objectionable documents into several subsets by using their properties, and then classify the subsets hierarchically. The proposed system consists of three layers. In each layer, we select features using the chi-square statistics, and then the weight of the features, which is calculated by using the TF-IDF weighting scheme, is used as an input of the non-linear SVM classifier. By means of a hierarchical scheme using the different features and the different number of features in each layer, we can characterize the objectionability of documents more effectively and expect to improve the performance of the rating system. We compared the performance of the proposed system and performance of several text rating systems and experimental results show that the proposed system can archive an excellent classification performance.

Keywords

References

  1. M. Fleck, D. Forsyth, and C. Bregler, 'Finding Naked People,' In European Conf. on Computer Vision, 1996, vol. II, pp.592–602
  2. C. Ding, C.H. Chi, J. Deng, and C.L. Dong, 'Centralized Content-Based Web Filtering and Blocking: How Far Can It Go?' IEEE SMC'99 Conference Proceedings, 1999, pp.115-119
  3. M. J. Jones and J. M. Rehg, 'Statistical Color Model with Application to Skin Detection,' In Technical Report CRL, 98/11:1-12, 1998
  4. J. Z. Wang, G Wiederhold and O. Firschein, 'System for Screening Objectionable Images,' Computer Communications, vol.21:1355-1600, 1998 https://doi.org/10.1016/S0140-3664(98)00203-5
  5. V. Jacob, R. Krishnan, Y.U. Ryu, R. Chandrasekaran and S. Hong. 'Filtering Objectionable Internet Content,' In Proceedings of the 20th International Conference on Information Systems, pp. 274-278, 1999
  6. R. Chandrasekaran, , Y. U. Ryu, V. Jacob, and S. Hong, 'Isotonic Separation,' Working Paper, Department of Management Science and Information Systems, School of Management, University of Texas at Dallas, 1998
  7. M. Hammami, Y. Chahir, and L. Chen, 'WebGuard: Web based Adult Content Detection and Filtering System,' IEEE/WIC International Conference on Web Intelligence, 2003
  8. H.G. Lee, Y.S. Kim, C.Y. Jeong, S.W. Han and T.Y. Nam, 'Multi level objectionable text classification using SVM and non-harmful document screen,' The 4th International Conference on Asian Language Processing and Information Technology, 2005
  9. G.Y. SU, J.H. LI, Y.H. MA and S.H. LI, 'Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model,' Journal of Zhejiang University SCIENCE, 2004 Vol. 5 No. 9 pp.1106-1113 https://doi.org/10.1631/jzus.2004.1106
  10. Y. Yang and J.O. Pedersen, 'A comparative study on feature selection in text categorization,' In Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp. 412-420
  11. G. Salton and M.J. McGill, 'Introduction to Modern Information Retrieval', New York: McGraw-Hill, 1983
  12. K.H. Lee, J. Kay, B.H. Kang, and U. Rosebrock, 'A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization,' In Proceedings of PRICAI, 2002, pp.444-453
  13. VN. Vapnik, 'The Nature of Statistical Learning Theory', Springer, 1995
  14. T. Joachims, 'Making Large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning', MIT-Press, 1999

Cited by

  1. Design and implementation of the SARIMA–SVM time series analysis algorithm for the improvement of atmospheric environment forecast accuracy pp.1433-7479, 2018, https://doi.org/10.1007/s00500-017-2825-y