Efficient Data Clustering using Fast Choice for Number of Clusters

Kim, Sung-Soo;Kang, Bum-Su;

doi:10.11627/jkise.2018.41.2.001

Journal of Korean Society of Industrial and Systems Engineering (산업경영시스템학회지)

Volume 41 Issue 2
/
Pages.1-8
/
2018
/
2005-0461(pISSN)
/
2287-7975(eISSN)

Society of Korea Industrial and System Engineering (한국산업경영시스템학회)

DOI QR Code

Efficient Data Clustering using Fast Choice for Number of Clusters

빠른 클러스터 개수 선정을 통한 효율적인 데이터 클러스터링 방법

Kim, Sung-Soo (Department of Industrial Engineering, Kangwon National University) ;
Kang, Bum-Su (Department of Industrial Engineering, Kangwon National University)

김성수 (강원대학교 산업공학과) ;
강범수 (강원대학교 산업공학과)

Received : 2018.02.21
Accepted : 2018.04.26
Published : 2018.06.30

https://doi.org/10.11627/jkise.2018.41.2.001 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

K-means algorithm is one of the most popular and widely used clustering method because it is easy to implement and very efficient. However, this method has the limitation to be used with fixed number of clusters because of only considering the intra-cluster distance to evaluate the data clustering solutions. Silhouette is useful and stable valid index to decide the data clustering solution with number of clusters to consider the intra and inter cluster distance for unsupervised data. However, this valid index has high computational burden because of considering quality measure for each data object. The objective of this paper is to propose the fast and simple speed-up method to overcome this limitation to use silhouette for the effective large-scale data clustering. In the first step, the proposed method calculates and saves the distance for each data once. In the second step, this distance matrix is used to calculate the relative distance rate ($V_j$) of each data j and this rate is used to choose the suitable number of clusters without much computation time. In the third step, the proposed efficient heuristic algorithm (Group search optimization, GSO, in this paper) can search the global optimum with saving computational capacity with good initial solutions using $V_j$ probabilistically for the data clustering. The performance of our proposed method is validated to save significantly computation time against the original silhouette only using Ruspini, Iris, Wine and Breast cancer in UCI machine learning repository datasets by experiment and analysis. Especially, the performance of our proposed method is much better than previous method for the larger size of data.

Keywords

References

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Perez, J.M., and Perona, I., An extensive comparative study of cluster validity indices, Pattern Recognition, 2013, Vol. 46, No. 1, pp. 243-256. https://doi.org/10.1016/j.patcog.2012.07.021
He, S., Wu, Q.H., and Saunders, J.R., Group search optimizer : an optimization algorithm inspired by animal searching behavior, IEEE transactions on evolutionary computation, 2009, Vol. 13, No. 5, pp. 973-990. https://doi.org/10.1109/TEVC.2009.2011992
Hruschka, E.R. and Ebecken, N.F., A genetic algorithm for cluster analysis, Intelligent Data Analysis, 2003, Vol. 7, No. 1, pp. 15-25.
Hruschka, E.R., Campello, R.J., and Freitas, A.A., A survey of evolutionary algorithms for clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2009, Vol. 39, No. 2, pp. 133-155. https://doi.org/10.1109/TSMCC.2008.2007252
Kang, B.S. and Kim S.S., Combined Artificial Bee Colony for Data Clustering, Journal of Society of Korea industrial and Systems Engineering, 2017, Vol. 40, No. 4, pp. 203-210. https://doi.org/10.11627/jkise.2017.40.4.203
Kim, S.S., Baek, J.Y., and Kang, B.S., Group Search Optimization Data Clustering Using Silhouette, Journal of the Korean Operations Research and Management Science Society, 2017, Vol. 42, No. 3, pp. 25-34.
Krishna, K. and Murty, M.N., Genetic K-means algorithm, IEEE Transactions on Systems, Man, and Cybernetics, Part B(Cybernetics), 1999, Vol. 29, No. 3, pp. 433-439. https://doi.org/10.1109/3477.764879
Lleti, R., Ortiz, M.C., Sarabia, L.A., and Sanchez, M.S., Selecting variables for k-means cluster analysis by using a genetic algorithm that optimizes the silhouettes, Analytica Chimica Acta, 2004, Vol. 515, No. 1, pp. 87-100. https://doi.org/10.1016/j.aca.2003.12.020
Ng, R.T. and Han, J., Efficient and Effective Clustering Methods for Spatial Data Mining, In Proceedings of VLDB, 1994, pp. 144-155.
Park, H.S. and Jun, C.H., A simple and fast algorithm for K-medoids clustering, Expert systems with applications, 2009, Vol. 36, No. 2, pp. 3336-3341. https://doi.org/10.1016/j.eswa.2008.01.039
Rousseeuw, P.J., Silhouettes : a graphical aid to the interpretation and validation of cluster analysis, Journal of computational and applied mathematics, 1987, Vol. 20, pp. 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
Ruspini, E.H., Numerical methods for fuzzy clustering, Information Sciences, 1970, Vol. 2, No. 3, pp. 319-350. https://doi.org/10.1016/S0020-0255(70)80056-1
Struyf, A., Hubert, M., and Rousseeuw, P., Clustering in an object-oriented environment, Journal of Statistical Software, 1997, Vol. 1, No. 4, pp. 1-30.
UCI machine learning repository Breast Cancer datasets, https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29.
UCI machine learning repository Glass datasets, https://archive.ics.uci.edu/ml/datasets/Glass+Identification.
UCI machine learning repository Iris datasets, https://archive.ics.uci.edu/ml/datasets/Iris.
UCI machine learning repository Wine datasets, https://archive.ics.uci.edu/ml/datasets/Wine.
Xu, R., Xu, J., and Wunsch, D.C., A comparison study of validity indices on swarm-intelligence-based clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part B(Cybernetics), 2012, Vol. 42, No. 4, pp. 1243-1256. https://doi.org/10.1109/TSMCB.2012.2188509