DOI QR코드

DOI QR Code

데이터 클러스터링을 위한 혼합 시뮬레이티드 어닐링

Hybrid Simulated Annealing for Data Clustering

  • 김성수 (강원대학교 시스템경영공학과) ;
  • 백준영 (강원대학교 시스템경영공학과) ;
  • 강범수 (강원대학교 시스템경영공학과)
  • Kim, Sung-Soo (Department of System & Management Engineering, Kangwon National University) ;
  • Baek, Jun-Young (Department of System & Management Engineering, Kangwon National University) ;
  • Kang, Beom-Soo (Department of System & Management Engineering, Kangwon National University)
  • 투고 : 2017.04.24
  • 심사 : 2017.06.19
  • 발행 : 2017.06.30

초록

Data clustering determines a group of patterns using similarity measure in a dataset and is one of the most important and difficult technique in data mining. Clustering can be formally considered as a particular kind of NP-hard grouping problem. K-means algorithm which is popular and efficient, is sensitive for initialization and has the possibility to be stuck in local optimum because of hill climbing clustering method. This method is also not computationally feasible in practice, especially for large datasets and large number of clusters. Therefore, we need a robust and efficient clustering algorithm to find the global optimum (not local optimum) especially when much data is collected from many IoT (Internet of Things) devices in these days. The objective of this paper is to propose new Hybrid Simulated Annealing (HSA) which is combined simulated annealing with K-means for non-hierarchical clustering of big data. Simulated annealing (SA) is useful for diversified search in large search space and K-means is useful for converged search in predetermined search space. Our proposed method can balance the intensification and diversification to find the global optimal solution in big data clustering. The performance of HSA is validated using Iris, Wine, Glass, and Vowel UCI machine learning repository datasets comparing to previous studies by experiment and analysis. Our proposed KSAK (K-means+SA+K-means) and SAK (SA+K-means) are better than KSA(K-means+SA), SA, and K-means in our simulations. Our method has significantly improved accuracy and efficiency to find the global optimal data clustering solution for complex, real time, and costly data mining process.

키워드

참고문헌

  1. Gungor, Z. and Unler, A., K-harmonic means data clustering with simulated annealing heuristic, Applied Mathematics and Computation, 2007, Vol. 184, No. 2, pp. 199-209. https://doi.org/10.1016/j.amc.2006.05.166
  2. Hruschka, E.R. and Campello, R.J., A survey of evolutionary algorithms for clustering, IEEE Transactions on systems, man, and cybernetics-Part C(Applications and reviews), 2009, Vol. 39, No. 2, pp. 133-155. https://doi.org/10.1109/TSMCC.2008.2007252
  3. Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P., Optimization by simulated annealing, Science, 1983, Vol. 220, No. 4598, pp. 671-680. https://doi.org/10.1126/science.220.4598.671
  4. Krishna, K. and Murty, M.N., Genetic K-means algorithm, IEEE Transactions on Systems, Man, and Cybernetics, Part B(Cybernetics), 2002, Vol. 29, No. 3, pp. 433-439.
  5. Kumar, Y. and Sahoo, G., A two-step artificial bee colony algorithm for clustering, Neural computing & Applications, 2017, Vol. 28, No. 3, pp. 537-551. https://doi.org/10.1007/s00521-015-2095-5
  6. Maulik, U. and Bandyopadhyay, S., Genetic algorithmbased clustering technique, Pattern Recognition, 2000, Vol. 33, No. 9, pp. 1455-1465. https://doi.org/10.1016/S0031-3203(99)00137-5
  7. Michalewicz, Z., Genetic Algorithms+Data Structures = Evolution Programs, Springer Verlag Berlin Heidelberg, New York, 1992.
  8. Oh, S.J. and Kim, J.Y., Clustering Algorithm for Sequences of Categorical Values, Journal of the Society of Korea Industrial and Systems Engineering, 2003, Vol. 26, No. 1, pp. 17-21.
  9. Perim, G.T., Wandekokem, E.D., and Varejao, F.M., K-Means Initialization Methods for Improving Clustering by Simulated Annealing, 11th Ibero-American Conference on AI, 2008, Lisbon, Vol. 5290, pp. 133-142.
  10. Selim, S.Z. and Alsultan, K., A simulated annealing algorithm for the clustering problem, Pattern Recognition, 1991, Vol. 24, No. 10, pp. 1003-1008. https://doi.org/10.1016/0031-3203(91)90097-O
  11. Seo, M.K. and Yun, W.Y., Clustering-based Monitoring and Fault detection in Hot Strip Roughing Mill, Journal of the Korean Society for Quality Management, 2017, Vol. 45, No. 1, pp. 25-38. https://doi.org/10.7469/JKSQM.2017.45.1.025
  12. Singh, S.S. and Chauhan, N.C., K-means v/s K-medoids : A Comparative Study, National Conference on Recent Trends in Engineering & Technology, 2011.
  13. Sun, L.X., Xie, Y.L., Song, X.H., Wang, J.H., and Yu, R.Q., Cluster analysis by simulated annealing, Computers & Chemistry, 1994, Vol. 18, Issue. 2, pp. 103-108. https://doi.org/10.1016/0097-8485(94)85003-8
  14. Sun, L.X., Xu, F., Liang, Y.Z., Xie, Y.L., and Yu, R.Q., Cluster analysis by the K-means algorithm and simulated annealing, Chemometrics and intelligent Laboratory Systems, 1994, Vol. 25, No. 1, pp. 51-60. https://doi.org/10.1016/0169-7439(94)00049-2
  15. UCI machine learning repository Cloud datasets, https://archive.ics.uci.edu/ml/datasets/cloud.
  16. UCI machine learning repository Glass datasets, https://archive.ics.uci.edu/ml/datasets/glass.
  17. UCI machine learning repository Iris datasets, https://archive.ics.uci.edu/ml/datasets/iris.
  18. UCI machine learning repository Vowel datasets, https://archive.ics.uci.edu/ml/datasets/vowel.
  19. UCI machine learning repository Wine datasets, https://archive.ics.uci.edu/ml/datasets/wine.