Category Factor Based Feature Selection for Document Classification

  • Kang Yun-Hee (Department of Computer and Communication Engineering Cheonan University)
  • Published : 2005.10.01

Abstract

According to the fast growth of information on the Internet, it is becoming increasingly difficult to find and organize useful information. To reduce information overload, it needs to exploit automatic text classification for handling enormous documents. Support Vector Machine (SVM) is a model that is calculated as a weighted sum of kernel function outputs. This paper describes a document classifier for web documents in the fields of Information Technology and uses SVM to learn a model, which is constructed from the training sets and its representative terms. The basic idea is to exploit the representative terms meaning distribution in coherent thematic texts of each category by simple statistics methods. Vector-space model is applied to represent documents in the categories by using feature selection scheme based on TFiDF. We apply a category factor which represents effects in category of any term to the feature selection. Experiments show the results of categorization and the correlation of vector length.