DOI QR코드

DOI QR Code

Neural Text Categorizer for Exclusive Text Categorization

  • Jo, Tae-Ho
  • Published : 2008.06.30

Abstract

This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of text categorization is degraded. Even if SVM (Support Vector Machine) is tolerable to huge dimensionality, it is not so to the second problem. The goal of this research is to address the two problems at same time by proposing a new representation of documents and a new neural network using the representation for its input vector.

Keywords

Disk Neural Text Categorizer;Text Categorization;NewsPage.com

References

  1. T. M. Mitchell, T. M., Machine Learning, McGraw-Hill, 1997
  2. N. Cristianini, and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000
  3. H. Drucker, D. Wu, and V. N. Vapnik, “Support Vector Machines for Spam Categorization”, IEEE Transaction on Neural Networks, Vol.10, No.5, pp.1048-1054, 1999 https://doi.org/10.1109/72.788645
  4. R. O. Duda, P. E. Hart, P. E., and D. G. Stork, Pattern Classification, John Wiley & Sons, Inc, 2001
  5. V. I. Frants, J. Shapiro, and V. G. Voiskunskii, Automated Information Retrieval: Theory and Methods, Academic Press, 1997
  6. S. Haykin, Neural Networks: Comprehensive Foundation, Macmillan College Publishing Company, 1994
  7. M. Hearst, “Support Vector Machines”, IEEE Intelligent Systems, Vol.13, No.4, pp.18-28, 1998 https://doi.org/10.1109/5254.708428
  8. P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
  9. T. Joachims, “Text Categorization with Support Vector Machines: Learning with many Relevant Features”, The Proceedings of $10^{th}$ European Conference on Machine Learning, pp.143-151, 1998
  10. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol.2, No.2, pp.419-444, 2002 https://doi.org/10.1162/153244302760200687
  11. T. Martin, H. B. Hagan, H. Demuth, and M. Beale, Neural Network Design, PWS Publishing Company, 1995
  12. B. Massand, G. Linoff, and D. Waltz, “Classifying News Stories using Memory based Reasoning”, The Proceedings of $15^{th}$ ACM International Conference on Research and Development in Information Retrieval, pp.59-65, 1992
  13. D. Mladenic, and M. Grobelink, “Feature Selection for unbalanced class distribution and Naïve Bayes”, The Proceedings of International Conference on Machine Learning, pp.256-267, 1999
  14. J. C. Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines”, Technical Report MSR-TR-98-14, 1998
  15. M.E. Ruiz, and P. Srinivasan, “Hierarchical Text Categorization Using Neural Networks”, Information Retrieval, Vol.5, No.1, pp.87-118, 2002 https://doi.org/10.1023/A:1012782908347
  16. E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995
  17. Y. Yang, “An evaluation of statistical approaches to text categorization”, Information Retrieval, Vol.1, No.1-2, pp.67-88, 1999
  18. Androutsopoulos, K. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, “An Experimental Comparison of Naïve Bayes and Keyword-based Anti-spam Filtering with personal email message”, The Proceedings of 23rd ACM SIGIR, pp.160-167, 2000
  19. M. T. Hagan, Demuth, H.B., and Beale, M. Neural Network Design, PWS Publishing Company, 1995
  20. J. Rennie, “Improving multi-class text classification with support vector machine”, Master's thesis, Massachusetts Institute of Technology, 2001
  21. F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol.34, No.1, pp.1-47, 2002 https://doi.org/10.1145/505282.505283

Cited by

  1. Effective language identification of forum texts based on statistical approaches vol.52, pp.4, 2016, https://doi.org/10.1016/j.ipm.2015.12.003