Variable Selection and Outlier Detection for Automated K-means Clustering

Kim, Sung-Soo;

doi:10.5351/CSAM.2015.22.1.055

Communications for Statistical Applications and Methods

Volume 22 Issue 1
/
Pages.55-67
/
2015
/
2287-7843(pISSN)
/
2383-4757(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Variable Selection and Outlier Detection for Automated K-means Clustering

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)

Received : 2014.10.28
Accepted : 2015.01.13
Published : 2015.01.31

https://doi.org/10.5351/CSAM.2015.22.1.055 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Keywords

References

Arai, K. and Barakbah, A. R. (2007). Hierarchical K-means: an algorithm for centroids initialization for K-means, Reports of the Faculty of Science and Engineering, Saga University, 36, 25-31.
Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821. https://doi.org/10.2307/2532201
Bartkowiak, A. (2005). Robust Mahalanobis distances obtained using the 'Multout'; and "Fast-mcd' Methods, Biocybernetics and Biomedical Engineering, 25, 7-21.
Brusco, M. J. and Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering, Pychometrika, 66, 249-270. https://doi.org/10.1007/BF02294838
Carmone, F. J., Kara, A. and Maxwell, S. (1999). HINoV; A new model to improve market segmentation by identifying noisy variables, Journal of Marketing Research, 36, 501-509. https://doi.org/10.2307/3152003
Everitt, B. S., Landau, S. and Leese, M. (2001). Cluster Analysis, Arnold.
Filzmoser, P. and Varmuza, K. (2013). Package Chemometrics. Documentation available at: http:// cran.r-project.org/web/packages/chemometrics/index.html.
Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering, Journal of Classification, 5, 205-228. https://doi.org/10.1007/BF01897164
Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder), Journal of the American Statistical Association, 78, 553-584. https://doi.org/10.1080/01621459.1983.10478008
Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis, Computer Journal, 41, 578-588. https://doi.org/10.1093/comjnl/41.8.578
Gnanadesikan, R., Kettenring, J. R. and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis, Journal of Classification, 7, 271-285.
Hautamaki, V., Cherednichenko, S., Karkkainen, I., Kinnunen, T. and Franti, P. (2005). Improving K-Means by Outlier Removal, LNCS Springer, Berlin / Heidelberg, may 2005, 978-987.
Hawkins, D. (1980). Identifications of Outliers, Chapman and Hall, London.
Hubert, L. and Arabie, P. (1985). Comparing partitions, Journal of Classification, 2, 193-218.
Jayakumar, G. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.
Jiang, M. F., Tseng, S. S. and Su, C. M. (2001). Two-phase clustering process for outliers detection, Pattern Recognition Letters, 22, 691-700. https://doi.org/10.1016/S0167-8655(00)00131-8
Kim, S. (2009). Automated K-means clustering and R implementation, The Korean Journal of Applied Statistics, 22, 723-733. https://doi.org/10.5351/KJAS.2009.22.4.723
Kim, S. (2012). A variable selection procedure for K-means clustering, The Korean Journal of Applied Statistics, 25, 471-483. https://doi.org/10.5351/KJAS.2012.25.3.471
Kriegel, H.-P., Kroger, P. and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining, Available from: https://www.siam.org/meetings/sdm10/ tutorial3.pdf.
Milligan, G. W. (1980). An examination of six types of the effects of error perturbation on fifteen clustering algorithms, Psychometrika, 45, 325-342. https://doi.org/10.1007/BF02293907
Milligan, G. W. (1985). An algorithm for generating artificial test clusters, Psychometrika, 50, 123-127. https://doi.org/10.1007/BF02294153
Milligan, G. W. (1989). A validation study of a variable-weighting algorithm for cluster analysis, Journal of Classification, 6, 53-71. https://doi.org/10.1007/BF01908588
Milligan, G. and Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179. https://doi.org/10.1007/BF02294245
Mojena, R. (1977). Hierarchical grouping method and stopping rules: An evaluation, The Computer Journal, 20, 359-363. https://doi.org/10.1093/comjnl/20.4.359
Mojena, R., Wishart, D. and Andrews, G. B. (1980). Stopping rules for Ward's clustering method, COMPSTAT, 426-432.
Pachgade, S. D. and Dhande, S. S. (2012). Outlier detection over data set using cluster-based and distance-based approach, International Journal of Advanced Research in Computer Science and Software Engineering, 2, 12-16.
Pamula, R., Deka, J. K. and Nandi, S. (2011). An outlier detection method based on clustering, Second International Conference on Emerging Applications of Information Technology, 253-256.
Qiu,W.-L. and Joe, H. (2006a). Generation of random clusters with specified degree of separation, Journal of Classification, 23, 315-334. https://doi.org/10.1007/s00357-006-0018-y
Qiu, W.-L. and Joe, H. (2006b). Separation index and partial membership for clustering, Computational, Statistics and Data Analysis, 50, 585-603. https://doi.org/10.1016/j.csda.2004.09.009
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of American Statistical Association, 66, 846-850. https://doi.org/10.1080/01621459.1971.10482356
Rocke, D. M. and Woodruff, D. L. (1996). Identification of outliers in multivariate data, Journal of the American Statistical Association, 91, 1047-1061. https://doi.org/10.1080/01621459.1996.10476975
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection, John Wiley and Sons, New York.
Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points, Journal of the American Statistical Association, 85, 633-651. https://doi.org/10.1080/01621459.1990.10474920
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the Number of Clusters in a Dataset via the Gap Statistic, Technical report, Dept of Biostatistics, Stanford University, Available from : http://www-stat.stanford.edu/-tibs/research.html.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function, Journal of American Statistical Association, 58, 236-244. https://doi.org/10.1080/01621459.1963.10500845
Wehrens R., Buydens L., Fraley, C. and Raftery, A. (2004). Model-based clustering for image seg- mentation and large datasets via sampling, Journal of Classification, 21, 231-253. https://doi.org/10.1007/s00357-004-0018-8

Cited by

k -means clustering with outlier removal vol.90, 2017, https://doi.org/10.1016/j.patrec.2017.03.008
Joint selection of variables and clusters: recovering the underlying structure of marketing data pp.2050-3326, 2019, https://doi.org/10.1057/s41270-018-0045-7

Communications for Statistical Applications and Methods

Variable Selection and Outlier Detection for Automated K-means Clustering

Abstract

Keywords

References

Cited by

Detail Search