DOI QR코드

DOI QR Code

Neural Text Categorizer for Exclusive Text Categorization

  • Jo, Tae-Ho (School of Computer and Information Engineering Inha University)
  • Published : 2008.06.30

Abstract

This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of text categorization is degraded. Even if SVM (Support Vector Machine) is tolerable to huge dimensionality, it is not so to the second problem. The goal of this research is to address the two problems at same time by proposing a new representation of documents and a new neural network using the representation for its input vector.

Keywords

References

  1. Androutsopoulos, K. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, “An Experimental Comparison of Naïve Bayes and Keyword-based Anti-spam Filtering with personal email message”, The Proceedings of 23rd ACM SIGIR, pp.160-167, 2000
  2. N. Cristianini, and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000
  3. H. Drucker, D. Wu, and V. N. Vapnik, “Support Vector Machines for Spam Categorization”, IEEE Transaction on Neural Networks, Vol.10, No.5, pp.1048-1054, 1999 https://doi.org/10.1109/72.788645
  4. R. O. Duda, P. E. Hart, P. E., and D. G. Stork, Pattern Classification, John Wiley & Sons, Inc, 2001
  5. V. I. Frants, J. Shapiro, and V. G. Voiskunskii, Automated Information Retrieval: Theory and Methods, Academic Press, 1997
  6. M. T. Hagan, Demuth, H.B., and Beale, M. Neural Network Design, PWS Publishing Company, 1995
  7. S. Haykin, Neural Networks: Comprehensive Foundation, Macmillan College Publishing Company, 1994
  8. M. Hearst, “Support Vector Machines”, IEEE Intelligent Systems, Vol.13, No.4, pp.18-28, 1998 https://doi.org/10.1109/5254.708428
  9. P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
  10. T. Joachims, “Text Categorization with Support Vector Machines: Learning with many Relevant Features”, The Proceedings of $10^{th}$ European Conference on Machine Learning, pp.143-151, 1998
  11. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol.2, No.2, pp.419-444, 2002 https://doi.org/10.1162/153244302760200687
  12. T. Martin, H. B. Hagan, H. Demuth, and M. Beale, Neural Network Design, PWS Publishing Company, 1995
  13. B. Massand, G. Linoff, and D. Waltz, “Classifying News Stories using Memory based Reasoning”, The Proceedings of $15^{th}$ ACM International Conference on Research and Development in Information Retrieval, pp.59-65, 1992
  14. T. M. Mitchell, T. M., Machine Learning, McGraw-Hill, 1997
  15. D. Mladenic, and M. Grobelink, “Feature Selection for unbalanced class distribution and Naïve Bayes”, The Proceedings of International Conference on Machine Learning, pp.256-267, 1999
  16. J. C. Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines”, Technical Report MSR-TR-98-14, 1998
  17. J. Rennie, “Improving multi-class text classification with support vector machine”, Master's thesis, Massachusetts Institute of Technology, 2001
  18. M.E. Ruiz, and P. Srinivasan, “Hierarchical Text Categorization Using Neural Networks”, Information Retrieval, Vol.5, No.1, pp.87-118, 2002 https://doi.org/10.1023/A:1012782908347
  19. F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol.34, No.1, pp.1-47, 2002 https://doi.org/10.1145/505282.505283
  20. E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995
  21. Y. Yang, “An evaluation of statistical approaches to text categorization”, Information Retrieval, Vol.1, No.1-2, pp.67-88, 1999

Cited by

  1. Effective language identification of forum texts based on statistical approaches vol.52, pp.4, 2016, https://doi.org/10.1016/j.ipm.2015.12.003