Automated K-Means Clustering and R Implementation

Kim, Sung-Soo;

doi:10.5351/KJAS.2009.22.4.723

The Korean Journal of Applied Statistics (응용통계연구)

Volume 22 Issue 4
/
Pages.723-733
/
2009
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Automated K-Means Clustering and R Implementation

자동화 K-평균 군집방법 및 R 구현

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)

김성수 (한국방송통신대학교 정보통계학과)

Published : 2009.08.31

https://doi.org/10.5351/KJAS.2009.22.4.723 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The crucial problems of K-means clustering are deciding the number of clusters and initial centroids of clusters. Hence, the steps of K-means clustering are generally consisted of two-stage clustering procedure. The first stage is to run hierarchical clusters to obtain the number of clusters and cluster centroids and second stage is to run nonhierarchical K-means clustering using the results of first stage. Here we provide automated K-means clustering procedure to be useful to obtain initial centroids of clusters which can also be useful for large data sets, and provide software program implemented using R.

K-평균 군집분석이 가지는 두 가지 근본적인 어려움은 사전에 미리 군집 수를 정해야 하는 문제와 초기 군집중심에 따라 결과가 달라질 수 있는 문제이다. 본 연구에서는 이러한 문제를 해결하기 위한 자동화 K-평균 군집분석 절차를 제안하고, R을 이용하여 구현한 결과를 제공한다. 자동화 K-평균 군집분석에서 제안된 절차는 처음 단계로서 계층적 군집분석을 행한 후 이를 이용하여 군집 수와 초기 군집수를 자동으로 정하고, 다음 단계로 이 결과를 이용하여 K-평균 군집분석을 수행하는 방법을 택하였다. 처음 단계에서 이용된 계층적 군집분석 방법으로는 Ward의 군집분석을 한 후에 Mojena의 규칙을 이용하여 군집 수를 정하는 방법을 택하거나, 모형근거 군집분석방법을 수행한 후에 BIC 값을 이용하여 군집 수를 정하는 방법을 이용하였다. 제안된 자동화 K-평균 군집절차에는 대량자료의 분석에도 용이하게 이용될 수 있도록 반복된 표본추출 방법을 이용하여 군집 수 및 군집 중심을 구하는 절차를 포함하였다. 구현된 R 프로그램은 www.knou.ac.kr/ sskim/autokmeans.r에서 제공하고 있다.

Keywords

References

김성수 (1999). 통계그래픽스를 이용한 K-평균 및 계층적 군집분석, <한국분류학회지>, 3, 13-27
허명회, 이용구 (2004). K-평균 군집화의 재현성 평가 및 응용, <응용통계연구>, 17, 135-144 https://doi.org/10.5351/KJAS.2004.17.1.135
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821 https://doi.org/10.2307/2532201
Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Psychometrika, 66, 249-270 https://doi.org/10.1007/BF02294838
Chen, J. S., Ching, R. K. H. and Lin, Y. S. (2004). An extended study of the K-means algorithm for data clustering and its applications, The Journal of the Operational Research Society, 55, 976-987 https://doi.org/10.1057/palgrave.jors.2601732
Dasgupta, A. and Raftery, A. E. (1998). Detecting features in spatial point processes with clutter via modelbased clustering, Journal of the American Statistical Association, 93, 294-302 https://doi.org/10.2307/2669625
Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold, London
Fraley, C. (1998). Algorithms for model-based gaussian hierarchical clustering, SIAM Journal on Scientific Computing, 20, 270-281 https://doi.org/10.1137/S1064827596311451
Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via modelbased cluster analysis, The Computer Journal, 41, 578-588 https://doi.org/10.1093/comjnl/41.8.578
Fraley, C. and Raftery, A. E. (2006). MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering, Technical Report No. 504, Department of Statistics University of Washington
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm, Applied Statistics, 28, 100-108 https://doi.org/10.2307/2346830
Kim, S. S., Kwon, S. and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51 https://doi.org/10.1007/s001840000043
Krzanowski, W. J. (1988). Principles of Multivariate Analysis, Oxford Science, Oxford
Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179 https://doi.org/10.1007/BF02294245
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation, The Computer Journal, 20, 359-363 https://doi.org/10.1093/comjnl/20.4.359
Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Wards'clustering method, COMPSTAT, 426-432
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of American Statistical Association, 66, 846-850 https://doi.org/10.2307/2284239
SPSS (2000). Clementine Application Templates for Telecommunication Industries(Telco CAT), Chicago, SPSS Inc.
Stanford, D. C. and Raftery, A. E. (2000). Principal curve clustering with noise, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 601-609 https://doi.org/10.1109/34.862198
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function, Journal of American Statistical Association, 58, 236-244 https://doi.org/10.2307/2282967
Wehrens, R., Buydens, L. M. C., Fraley, C. and Raftery, A. E. (2004). Model-based clustering for image segmentation and large data sets via sampling, Journal of Classification, 21, 231-253 https://doi.org/10.1007/s00357-004-0018-8

Cited by

A Variable Selection Procedure for K-Means Clustering vol.25, pp.3, 2012, https://doi.org/10.5351/KJAS.2012.25.3.471
Variable Selection and Outlier Detection for Automated K-means Clustering vol.22, pp.1, 2015, https://doi.org/10.5351/CSAM.2015.22.1.055

The Korean Journal of Applied Statistics (응용통계연구)

Automated K-Means Clustering and R Implementation

자동화 K-평균 군집방법 및 R 구현

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)