Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

Jo, Tae-Ho;

doi:10.3745/JIPS.2008.4.2.067

Journal of Information Processing Systems

Volume 4 Issue 2
/
Pages.67-76
/
2008
/
1976-913X(pISSN)
/
2092-805X(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

Jo, Tae-Ho (School of Computer and Information Engineering Inha University)

Published : 2008.06.30

https://doi.org/10.3745/JIPS.2008.4.2.067 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text clustering, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the k means algorithm adaptable to string vectors for text clustering.

Keywords

References

C. Ambroise, and G. Govaert, “Convergence of an EM-type algorithm for spatial clustering”, Pattern Recognition Letters, Vol.19, No.10, pp.919-927, 1998 https://doi.org/10.1016/S0167-8655(98)00076-2
Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Generative model-based clustering of directional data”, The Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.19-28, 2003
G. Bote, P. Vincent, M. A. Felix, and V. B. Solana, “Document Organization using Kohonen's Algorithm”, Information Processing and Management, Vol.38, No.1, pp.79-89, 2002 https://doi.org/10.1016/S0306-4573(00)00066-2
G. Celeux, and G. Govaert, “A Classification EM algorithm for clustering and two stochastic versions”, Computational Statistics & Data Analysis, Vol.14, No. 3, pp.315-332, 1992 https://doi.org/10.1016/0167-9473(92)90042-E
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via EM algorithm”, Journal of the Royal Statistics Society, Series B, Vol.39, No.1, pp.1-38, 1977
V. Hatzivassiloglou, L. Gravano, and A. Maganti, “An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering”, The Proceedings of 23rd SIGIR, pp.224-231, 2000
P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
T. Jo, “The Concepts of Text Mining”, The Proceedings of ICACT 2000, pp.124-129, 2000
T. Jo, “Dynamic Document Organization using Text Categorization and Text Clustering, PhD Dissertation of University of Ottawa, 2006
T. Jo and M. Lee, “String Vectors in Unsupervised Learning for Text Clustering”, Information Systems, submitted, 2007
T. Jo and M. Lee, “The Evaluation Measure of Text Clustering for the Variable Number of Clusters”, Lecture Notes in Computer Science, Vol.4492 pp.871-879, 2007
T. Jo and N. Japkowicz, “Text Clustering using NTSO”, The Proceedings of IJCNN, pp.558-563, 2005
T. Kohonen, “Self Organized Formation of Topologically Correct Feature Maps”, Biological Cybernetics, Vol.43, pp.59-69, 1982 https://doi.org/10.1007/BF00337288
T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, V. Paatero, and A. Saarela, “Self Organization of a Massive Document Collection”, IEEE Transaction on Neural Networks, Vol.11, No.3, pp.574-585, 2002 https://doi.org/10.1007/BF00337288
F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol. 34, No.1, 2002, pp.1-47, 2002 https://doi.org/10.1145/505282.505283
E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995 https://doi.org/10.1145/505282.505283
S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “WEBSOM-Self Organizing Maps of Document Collections”, Neurocomputing, Vol.21, pp.101-117, 1998 https://doi.org/10.1016/S0925-2312(98)00039-3
Mitchell, T. M., Machine Learning, McGraw-Hill, 1997 https://doi.org/10.1016/S0925-2312(98)00039-3
A. Vinokourov, and M. Girolami, “A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents”, The Proceedings of 15th International Conference on Pattern Recognition, pp.182-185, 2000
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002 https://doi.org/10.1162/153244302760200687
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002 https://doi.org/10.1162/153244302760200687

Cited by

Locating communities on graphs with variations in community sizes vol.65, pp.2, 2013, https://doi.org/10.1007/s11227-012-0806-6

Journal of Information Processing Systems

Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)