DOI QR코드

DOI QR Code

Reproducibility Assessment of K-Means Clustering and Applications

K-평균 군집화의 재현성 평가 및 응용

  • 허명회 (고려대학교 통계학과) ;
  • 이용구 (중앙대학교 응용통계학과)
  • Published : 2004.03.01

Abstract

We propose a reproducibility (validity) assessment procedure of K-means cluster analysis by randomly partitioning the data set into three parts, of which two subsets are used for developing clustering rules and one subset for testing consistency of clustering rules. Also, as an alternative to Rand index and corrected Rand index, we propose an entropy-based consistency measure between two clustering rules, and apply it to determination of the number of clusters in K-means clustering.

K-평균 군집화(K-means clustering)는 고객 세분화(customer segmentation) 등 데이터 마이닝에서 중요한 한 몫을 하는 비지도 학습방법 (unsupervised learning method)이다. K-평균 군집화가 재현성(reproducibility)이 있는가를 보기 위하여, 다수의 기존 연구에서는 관측 자료를 2개 셋으로 나눈 자료 분할(data partitioning) 방법이 활용되고 있다. 본 교신에서 우리는 이보다 개념적으로 명확한 새로운 자료 분할 방법을 제안한다. 이 방법은 관측 자료를 3개 셋으로 나누어 그 중 2개 자료 셋을 독립적인 군집화 규칙을 생성하는 데 사용하고 나머지 1개의 자료 셋을 규칙간 일치성을 테스트하는데 사용한다. 또한 2개의 군집화 규칙간 일치성 평가를 위한 지표로서 엔트로피 기준의 환용 방법을 제시한다.

Keywords

References

  1. 자연과학(대전대학교 기초과학연구소 논문집) v.8 no.1 재표본 추출 및 검정을 통한 집락 수의 예측 채성산
  2. Natural Science v.6 An asymptotic result concerning a comparative statistic in cluster analysis Chae, S.S.
  3. Journal of Korean Statistical Society v.20 A method to predict the number of clusters Chae, S.S.;Warde, W.D
  4. Classification(2nd Ed.) Gordon, A.D.
  5. Journal of Classification v.2 Comparing partitions Hubert, L.;Arabie, P. https://doi.org/10.1007/BF01908075
  6. Multivariate Behavioral Research v.15 A nearest-centroid technique for evaluating the minimum-variance clustering procedure McIntyre, R.M.;Blashfield, R.K. https://doi.org/10.1207/s15327906mbr1502_7
  7. Clustering and Classification Clustering validation: Results and implications for applied analyses Milligan, G.W.;P. Arabie(et al.)(Ed.)
  8. Multivariate Behavioral Research v.18 A comparison of cluster analysis techniques within a sequential validation framework Morey, L.C.;Blasfield, R.K.;Skinner, H.A. https://doi.org/10.1207/s15327906mbr1803_4
  9. C4.5: Programs for Machine Learning Quilan, J.R.
  10. Journal of American Statistical Association v.66 Objective criteria for the evaluation of clustering methods Rand, W.M. https://doi.org/10.2307/2284239
  11. Clementine Application Templates for Telecommunication Industries(Telco CAT) SPSS

Cited by

  1. A method of predicting the number of clusters using Rand's statistic vol.50, pp.12, 2006, https://doi.org/10.1016/j.csda.2005.08.006
  2. A Study on Fault Prediction Method in a Pump Tower of LNG FPSO vol.21, pp.2, 2016, https://doi.org/10.7315/CADCAM.2016.111
  3. A Comparison on the Forest Type of Coastal Disaster Prevention Forest Between the Coastal Areas in Korea vol.103, pp.4, 2014, https://doi.org/10.14578/jkfs.2014.103.4.564
  4. Collaborative Filtering Design Using Genre Similarity and Preffered Genre vol.16, pp.4, 2011, https://doi.org/10.9708/jksci.2011.16.4.159
  5. Ryodoraku pattern classifications of tinnitus patients using cluster analysis vol.26, pp.4, 2013, https://doi.org/10.6114/jkood.2013.26.4.051
  6. Recommender system design using movie genre similarity and preferred genres in SmartPhone vol.61, pp.1, 2012, https://doi.org/10.1007/s11042-011-0728-y
  7. Subspace Projection Method Based Clustering Analysis in Load Profiling vol.29, pp.6, 2014, https://doi.org/10.1109/TPWRS.2014.2309697