# Neural Text Categorizer for Exclusive Text Categorization

• Jo, Tae-Ho (School of Computer and Information Engineering Inha University)
• Published : 2008.06.30
• 156 43

#### Abstract

This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of text categorization is degraded. Even if SVM (Support Vector Machine) is tolerable to huge dimensionality, it is not so to the second problem. The goal of this research is to address the two problems at same time by proposing a new representation of documents and a new neural network using the representation for its input vector.

#### Keywords

Disk Neural Text Categorizer;Text Categorization;NewsPage.com

#### References

1. T. M. Mitchell, T. M., Machine Learning, McGraw-Hill, 1997
2. N. Cristianini, and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000
3. H. Drucker, D. Wu, and V. N. Vapnik, “Support Vector Machines for Spam Categorization”, IEEE Transaction on Neural Networks, Vol.10, No.5, pp.1048-1054, 1999 https://doi.org/10.1109/72.788645
4. R. O. Duda, P. E. Hart, P. E., and D. G. Stork, Pattern Classification, John Wiley & Sons, Inc, 2001
5. V. I. Frants, J. Shapiro, and V. G. Voiskunskii, Automated Information Retrieval: Theory and Methods, Academic Press, 1997
6. S. Haykin, Neural Networks: Comprehensive Foundation, Macmillan College Publishing Company, 1994
7. M. Hearst, “Support Vector Machines”, IEEE Intelligent Systems, Vol.13, No.4, pp.18-28, 1998 https://doi.org/10.1109/5254.708428
8. P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
9. T. Joachims, “Text Categorization with Support Vector Machines: Learning with many Relevant Features”, The Proceedings of $10^{th}$ European Conference on Machine Learning, pp.143-151, 1998
10. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol.2, No.2, pp.419-444, 2002 https://doi.org/10.1162/153244302760200687
11. T. Martin, H. B. Hagan, H. Demuth, and M. Beale, Neural Network Design, PWS Publishing Company, 1995
12. B. Massand, G. Linoff, and D. Waltz, “Classifying News Stories using Memory based Reasoning”, The Proceedings of $15^{th}$ ACM International Conference on Research and Development in Information Retrieval, pp.59-65, 1992
13. D. Mladenic, and M. Grobelink, “Feature Selection for unbalanced class distribution and Naïve Bayes”, The Proceedings of International Conference on Machine Learning, pp.256-267, 1999
14. J. C. Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines”, Technical Report MSR-TR-98-14, 1998
15. M.E. Ruiz, and P. Srinivasan, “Hierarchical Text Categorization Using Neural Networks”, Information Retrieval, Vol.5, No.1, pp.87-118, 2002 https://doi.org/10.1023/A:1012782908347
16. E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995
17. Y. Yang, “An evaluation of statistical approaches to text categorization”, Information Retrieval, Vol.1, No.1-2, pp.67-88, 1999
18. Androutsopoulos, K. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, “An Experimental Comparison of Naïve Bayes and Keyword-based Anti-spam Filtering with personal email message”, The Proceedings of 23rd ACM SIGIR, pp.160-167, 2000
19. M. T. Hagan, Demuth, H.B., and Beale, M. Neural Network Design, PWS Publishing Company, 1995
20. J. Rennie, “Improving multi-class text classification with support vector machine”, Master's thesis, Massachusetts Institute of Technology, 2001
21. F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol.34, No.1, pp.1-47, 2002 https://doi.org/10.1145/505282.505283

#### Cited by

1. Effective language identification of forum texts based on statistical approaches vol.52, pp.4, 2016, https://doi.org/10.1016/j.ipm.2015.12.003