Inverted Index based Modified Version of K-Means Algorithm for Text Clustering

  • Jo, Tae-Ho (School of Computer and Information Engineering Inha University)
  • Published : 2008.06.30


This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text clustering, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the k means algorithm adaptable to string vectors for text clustering.


String Vector;K Means Algorithm;Text Clustering


  1. C. Ambroise, and G. Govaert, “Convergence of an EM-type algorithm for spatial clustering”, Pattern Recognition Letters, Vol.19, No.10, pp.919-927, 1998
  2. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Generative model-based clustering of directional data”, The Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.19-28, 2003
  3. G. Bote, P. Vincent, M. A. Felix, and V. B. Solana, “Document Organization using Kohonen's Algorithm”, Information Processing and Management, Vol.38, No.1, pp.79-89, 2002
  4. G. Celeux, and G. Govaert, “A Classification EM algorithm for clustering and two stochastic versions”, Computational Statistics & Data Analysis, Vol.14, No. 3, pp.315-332, 1992
  5. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via EM algorithm”, Journal of the Royal Statistics Society, Series B, Vol.39, No.1, pp.1-38, 1977
  6. P. Jackson, and I. Mouliner, Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization, John Benjamins Publishing Company, 2002
  7. T. Jo, “The Concepts of Text Mining”, The Proceedings of ICACT 2000, pp.124-129, 2000
  8. T. Jo, “Dynamic Document Organization using Text Categorization and Text Clustering, PhD Dissertation of University of Ottawa, 2006
  9. T. Jo and M. Lee, “String Vectors in Unsupervised Learning for Text Clustering”, Information Systems, submitted, 2007
  10. T. Jo and N. Japkowicz, “Text Clustering using NTSO”, The Proceedings of IJCNN, pp.558-563, 2005
  11. T. Kohonen, “Self Organized Formation of Topologically Correct Feature Maps”, Biological Cybernetics, Vol.43, pp.59-69, 1982
  12. T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, V. Paatero, and A. Saarela, “Self Organization of a Massive Document Collection”, IEEE Transaction on Neural Networks, Vol.11, No.3, pp.574-585, 2002
  13. F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Survey, Vol. 34, No.1, 2002, pp.1-47, 2002
  14. E. D. Wiener, “A Neural Network Approach to Topic Spotting in Text”, The Thesis of Master of University of Colorado, 1995
  15. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “WEBSOM-Self Organizing Maps of Document Collections”, Neurocomputing, Vol.21, pp.101-117, 1998
  16. Mitchell, T. M., Machine Learning, McGraw-Hill, 1997
  17. A. Vinokourov, and M. Girolami, “A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents”, The Proceedings of 15th International Conference on Pattern Recognition, pp.182-185, 2000
  18. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002
  19. V. Hatzivassiloglou, L. Gravano, and A. Maganti, “An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering”, The Proceedings of 23rd SIGIR, pp.224-231, 2000
  20. T. Jo and M. Lee, “The Evaluation Measure of Text Clustering for the Variable Number of Clusters”, Lecture Notes in Computer Science, Vol.4492 pp.871-879, 2007
  21. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, Text Classification with String Kernels, Journal of Machine Learning Research, Vol 2, No 2, pp419-444, 2002

Cited by

  1. Locating communities on graphs with variations in community sizes vol.65, pp.2, 2013,