DOI QR코드

DOI QR Code

Fast K-Means Clustering Algorithm using Prediction Data

예측 데이터를 이용한 빠른 K-Means 알고리즘

  • 지태창 (연세대학교 컴퓨터과학과) ;
  • 이현진 (한국사이버대학교 컴퓨터정보통신학부) ;
  • 이일병 (연세대학교 컴퓨터과학과)
  • Published : 2009.01.28

Abstract

In this paper we proposed a fast method for a K-Means Clustering algorithm. The main characteristic of this method is that it uses precalculated data which possibility of change is high in order to speed up the algorithm. When calculating distance to cluster centre at each stage to assign nearest prototype in the clustering algorithm, it could reduce overall computation time by selecting only those data with possibility of change in cluster is high. Calculation time is reduced by using the distance information produced by K-Means algorithm when computing expected input data whose cluster may change, and by using such distance information the algorithm could be less affected by the number of dimensions. The proposed method was compared with original K-Means method - Lloyd's and the improved method KMHybrid. We show that our proposed method significantly outperforms in computation speed than Lloyd's and KMHybrid when using large size data which has large amount of data, great many dimensions and large number of clusters.

본 논문에서 K-Means 군집화 알고리즘을 빠르게 적용하는 방법을 제안했다. 제안하는 알고리즘의 특징은 속도 향상을 위해 변화될 가능성이 있는 데이터를 예측하는 것이다. 군집화 알고리즘의 각 단계에서 군집이 변경될 가능성이 있는 데이터만 선택하여 군집 중심과의 거리를 계산함으로써 전체 군집 계산 시간을 줄일 수 있었다. 군집이 변화될 예측 데이터를 계산할 때는 K-Means 알고리즘을 적용하면서 생성되는 거리 정보를 사용함으로써 추가되는 계산 시간이 적고, 특히, 거리 정보를 이용하기 때문에 차원의 개수에는 영향을 덜 받는 알고리즘을 제안할 수 있었다. 제안하는 알고리즘의 성능 비교를 위해서 원래의 K-Means인 Lloyd's와 이를 개선한 KMHybrid와 비교했다. 제안하는 알고리즘은 대용량 데이터( 입력 데이터의 크기가 크고, 데이터의 차원이 크며, 군집의 개수가 많은 경우)의 경우에 Lloyd's와 KMHybrid보다 높은 속도 향상을 보였다.

Keywords

References

  1. K. Alsabti, S. Ranka and V. Singh, "An Efficient K-Means Clustering Algorithm," Proc. of the first Workshop on High Performance Data Mining, 1998.
  2. M. Badoiu, S. Har-Peled, and P. Indyk, "Approximate Clustering via Coresets," Proc. 34th Annual ACM Symposium on Theory of Computing, pp.250-257, 2002.
  3. R. C. Dubes and A. K. Jain, "Algorithms for Clustering Data", Prentice Hall, 1988.
  4. M. Ester, H. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
  5. M. Ester, H. Kriegel, and X. Xu, "Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification," Proc. of the Fourth International Symposium on Large Spatial Databases, 1995.
  6. W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani, "Approximation schemes for clustering problems," Proc. 35th Annual ACM Symposium on Theory of Computing, pp.50-58, 2003.
  7. E. Forgey, "Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification," Biometrics, Vol.21, No.3, p.768, 1965.
  8. G. Frahling and C. Sohler, "A fast k-means implementation using corsets," Proc. of the twenty-second annual symposium on Computational Geometry (SoCG'06), pp.135-143, 2006.
  9. S. Har-Peled and S. Mazumdar, "Coresets for k-Means and k-Median Clustering and their Applications," Proc. 36th Annual ACM Symposium on Theory of Computing, pp.291-300, 2004.
  10. J. He, A. H. Tan, and S. Y. Sung, "On Quantitative Evaluation of Clustering Systems," Information Retrieval and Clustering, Kluwer Academic Publishers, 2003.
  11. J. Garcia, J. Fedz-Valdivia, F. Cortijo, and R. Molina, "Dynamic Approach for Clustering Data," Signal Processing, Vol.44, No.2, 1994.
  12. J. Hartigan, "Clustering Algorithms," John Wiley & Sons, New York, 1975.
  13. A. Jain and R. Dubes, "Algorithms for Clustering Data," Prentice Hall, New Jersey, 1988.
  14. D. Judd, P. McKinley, and A. Jain, "Large-Scale Parallel Data Clusrering," Proc. of the International Conference on Pattern Recognition, 1996.
  15. T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, "A Local Search Approximation Algorithm for k-Means Clustering," Proc. of the 18th Annual Symposium on Computational Geometry (SoCG'02), pp.89-112, 2004.
  16. T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, "An Efficient K-Means Clustering Algorithm: Analysis and Implementation," IEEE Trans. Pattern Analysis and Mchine Intelligence, Vol.24, No.7, pp.881-892, 2002. https://doi.org/10.1109/TPAMI.2002.1017616
  17. L. Kaufman and P. J. Rousseeuw, "Finding Groups in Data: and Introduction to Cluster Analysis," John Wiley & Sons, 1990.
  18. A. Kumar, Y. Sabharwal, and S. Sen, "A simple linear time ($1+{\epsilon}$)-approximation algorithm for k-means clustering in any dimensions," Proc. 45th IEEE Symposium on Foundations of Computer Science, pp.454-462, 2004. https://doi.org/10.1109/FOCS.2004.7
  19. S. Lloyd, "Least Squares Quantization in PCM", IEEE Transactions on Information Theory, Vol.28, pp.129-137, 1982. https://doi.org/10.1109/TIT.1982.1056489
  20. J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol.1, pp.281-296, 1967.
  21. R. Mettu and G. Plaxton. "Optimal Time Bounds for Approximate Clustering," Machine Learning, Vol.56, No.1-3, pp.35-60, 2004. https://doi.org/10.1023/B:MACH.0000033114.18632.e0
  22. D. Mount, "KMlocal: A Testbed for k-menas Clustering Algorithms," Available at http://www.cs.umd.edu/-mount
  23. R. T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. of the 20th International Conference on Very Large Databases, pp.144-155, 1994.
  24. D. Pelleg and A. Moore, "x-Means: Extending k-means with efficient estimation of the number of clusters," Proc. 17th International Conference of Machine Learning, 2000.
  25. S. Philips, "Acceleration of k-Means and Related Clustering Problems," Proc. of Algorithms Engineering and Experiments, 2002.
  26. E. Schikuta, "Grid Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets," Proc. 13th International Conference on Pattern Recognition, 1996.
  27. T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," Proc. of the 1996 ACM SIGMOD International Conference. on Management of Data, pp.103-114, 1996.