Inverted Index based Modified Version of KNN for Text Categorization

Jo, Tae-Ho;

doi:10.3745/JIPS.2008.4.1.017

Journal of Information Processing Systems

Volume 4 Issue 1
/
Pages.17-26
/
2008
/
1976-913X(pISSN)
/
2092-805X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Inverted Index based Modified Version of KNN for Text Categorization

Jo, Tae-Ho (School of Computer and Information Engineering Inha University)

Published : 2008.03.31

https://doi.org/10.3745/JIPS.2008.4.1.017 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This research proposes a new strategy where documents are encoded into string vectors and modified version of KNN to be adaptable to string vectors for text categorization. Traditionally, when KNN are used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text categorization, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the supervised learning algorithms adaptable to string vectors for text categorization.

Keywords

References

Androutsopoulos, K. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, “An Experimental Comparison of Naive Bayes and Keyword-based Anti-spam Filtering with personal email message”, The Proceedings of 23rd ACM SIGIR, pp160-167, 2000
N. Cristianini and J. Shawe-Taylor, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000
H. Drucker, D. Wu, and V. N. Vapnik, “Support Vector Machines for Spam Categorization”, IEEE Transaction on Neural Networks, Vol 10, No 5, pp1048-1054, 1999 https://doi.org/10.1109/72.788645
A. Estabrooks, T. Jo, and N . Japkowicz, “A Multiple Resampling Method for Learning from Imbalanced Data Sets”, Computational Intelligence, Vol 28, No 1, pp18-26, 2004 https://doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x
M. Hearst, “Support Vector Machines”, IEEE Intelligent Systems, Vol 13, No 4, pp18-28, 1998 https://doi.org/10.1109/5254.708428
P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
T. Joachims, “Text Categorization with Support Vector Machines: Learning with many Relevant Features”, The Proceedings of 10th European Conference on Machine Learning, pp143-151, 1998
T. Jo, and N. Japkowicz, “Class Imbalances versus Small Disjuncts”, ACM SIGKDD Exploration Newsletters, Vol 6, No1, pp40-49, 2004 https://doi.org/10.1145/1007730.1007737
T. Jo and N. Japkowicz, “Text Clustering using NTSO”, The Proceedings of IJCNN, pp558-563, 2005
R. R. Korfahage, Information Storage and Retrieval, Wiley Computer Publishing, 1997
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text Classification with String Kernels, Journal of Machine Learning Research”, Vol 2, No 2, pp419-444, 2002 https://doi.org/10.1162/153244302760200687
B. Massand, G. Linoff, and D. Waltz, “Classifying News Stories using Memory based Reasoning”, The Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval, pp59-65, 1992
T. Mitchell, Machine Learning, McGraw-Hill, 1997
D. Mladenic and M. Grobelink, “Feature Selection for unbalanced class distribution and Naive Bayes”, The Proceedings of International Conference on Machine Learning, pp256-267, 1999
M. E. Ruiz and P. Srinivasan, “Hierarchical Text Categorization Using Neural Networks”, Information Retrieval, Vol 5, No 1, pp87-118, 2002 https://doi.org/10.1023/A:1012782908347
F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol 34, No 1, pp1-47, 2002 https://doi.org/10.1145/505282.505283
E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995
Y. Yang, “An evaluation of statistical approaches to text categorization”, Information Retrieval, Vol 1, No 1-2, pp67-88, 1999

Journal of Information Processing Systems

Inverted Index based Modified Version of KNN for Text Categorization

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)