DOI QR코드

DOI QR Code

Comparison between k-means and k-medoids Algorithms for a Group-Feature based Sliding Window Clustering

그룹특징기반 슬라이딩 윈도우 클러스터링에서의 k-means와 k-medoids 비교 평가

  • Yang, Ju-Yon (Dept of Computer Science, Sookmyung Women's University) ;
  • Shim, Junho (Dept of Computer Science, Sookmyung Women's University)
  • Received : 2018.08.14
  • Accepted : 2018.08.24
  • Published : 2018.08.31

Abstract

The demand for processing large data streams is growing rapidly as the generation and processing of large volumes of data become more popular. A variety of large data processing technologies are being developed to suit the increasing demand. One of the technologies that researchers have particularly observed is the data stream clustering with sliding windows. Data stream clustering with sliding windows may create a new set of clusters whenever the window moves. Previous data stream clustering techniques with sliding windows exploit the coresets, also known as group features that summarize the data. In this paper, we present some reformable elements of a group-feature based algorithm, and propose our algorithm that modified the clustering algorithm of the original one. We conduct a performance comparison between two algorithms by using different parameter values. Finally, we provide some guideline for the selective use of those algorithms with regard to the parameter values and their impacts on the performance.

대용량 데이터의 발생과 처리가 대중화되면서 대용량 데이터 스트림 처리에 대한 수요가 급격하게 증가하고 있다. 이 수요에 따라 다양한 대용량 데이터 처리 기술이 개발되고 있다. 한 분야로 주목받고 있는 방식은 슬라이딩 윈도우를 사용한 데이터 스트림 클러스터링이다. 슬라이딩 윈도우를 사용한 데이터 스트림 클러스터링은 윈도우가 이동할 때마다 새로운 클러스터를 생성한다. 기존의 슬라이딩 윈도우 상의 클러스터링 기법은 코어셋(Coreset)을 기반으로 데이터 스트림 클러스터링을 구현하고 있다. 이 연구에서는 코어셋을 활용한 그룹특징을 이용한 알고리즘 내에서 이용하는 클러스터링 알고리즘을 변경하였다. 그리고 이를 통해 제안 알고리즘과 기존 알고리즘의 파라미터 값 변화에 따른 성능 비교 실험을 진행하였다. 개선된 사항에 대해 논하여 두 알고리즘을 비교하고 실험자에게 파라미터에 따른 이용 방향을 제시한다.

Keywords

KJGRBH_2018_v23n3_225_f0001.png 이미지

Sliding Window Process

KJGRBH_2018_v23n3_225_f0002.png 이미지

Clustering based on Group Features

KJGRBH_2018_v23n3_225_f0003.png 이미지

Process for K-Means and K-Medoids

KJGRBH_2018_v23n3_225_f0004.png 이미지

Performance Comparison Procedure

KJGRBH_2018_v23n3_225_f0005.png 이미지

How to Process TSV Files as K-Means

KJGRBH_2018_v23n3_225_f0006.png 이미지

How to Process TSV Files as K-Medoids

KJGRBH_2018_v23n3_225_f0007.png 이미지

SSQ of the Dataset ‘knews.tsv’(Comparative group: K)

KJGRBH_2018_v23n3_225_f0008.png 이미지

SSQ of the Dataset ‘covtype.tsv’(Comparative group: K)

KJGRBH_2018_v23n3_225_f0009.png 이미지

SSQ of the Dataset ‘syn1k30d40.tsv’(Comparative group: K)

KJGRBH_2018_v23n3_225_f0010.png 이미지

Process Time of the Dataset ‘knews.tsv’(Comparative group: K)

KJGRBH_2018_v23n3_225_f0011.png 이미지

Process Time of the Dataset ‘covtype.tsv’(Comparative group: K)

KJGRBH_2018_v23n3_225_f0012.png 이미지

Process Time of the Dataset ‘syn1k30d40.tsv’(Comparative group: K)

References

  1. Ackermann, M. R., Martens, M. Raupach, C., Wsierket, K., Lammersen, C., and Sohler, C., "StreamKm++: A clustering algorithm for data streams," Journal of Experimental Algorithmics, Vol. 17, No. 1, pp. 2.4:2.1-2.4:2.30, 2012.
  2. Aggarwal et al., "A framework for clustering evolving data streams," Proceedings of the 29th international conference on Very large data bases, Vol. 29, pp. 81-92, 2003.
  3. Anderson, M. J., "A new method for nonparametric multivariate analysis of variance," Austral Ecology, Vol. 26, No. 1, pp. 32-46, 2001. https://doi.org/10.1046/j.1442-9993.2001.01070.x
  4. Braverman, V., Lang, H., Levin, K., and Monemizadeh, M., "Clustering problems on sliding windows," Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 1374-1390, 2016.
  5. Cao, F., Ester, M., Qian W., and Zhou A., "Density-based clustering over an evolving data stream with noise," 2006 SIAM Conference on Data Mining, pp. 328-329, 2006.
  6. De Mauro, A., Greco, M., and Grimaldi, M., "A formal definition of Big Data based on its essential features," Library Review, Vol. 65, No. 3, pp. 122-135, 2016. https://doi.org/10.1108/LR-06-2015-0061
  7. Raff, E., "JSAT: Java statistical analysis tool, a library for machine learning," The Journal of Machine Learning Research, Vol. 18, No. 1, pp. 792-796, 2017.
  8. Yang, B. and Shim, J., "Practical Datasets for Similarity Measures and Their Threshold Values," The Journal of Society for e-Business Studies, Vol. 18, No. 1, pp. 97-105, 2013. https://doi.org/10.7838/jsebs.2013.18.1.097
  9. Youn, J. H., A Scalable Clustering Algorithm for High-dimensional Data Streams over Sliding Windows, Diss. Seoul National University, 2017.
  10. Zhang, T., Remakrishnan, R., and Livny, M., "Birch, An efficient data clustering method for very large databases," SIGMOD record, Vol. 25, No. 2, pp. 103-114, 1996. https://doi.org/10.1145/235968.233324
  11. Zhou, X. and Jin, Q., "A heuristic approach to discovering user correlations from organized social stream data," Multimedia Tools and Applications, Vol. 76, No. 9, pp. 11487-11507, 2017. https://doi.org/10.1007/s11042-014-2153-5