Parallel k-Modes Algorithm for Spark Framework

Chung, Jaehwa;

doi:10.3745/KTSDE.2017.10.487

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 6 Issue 10
/
Pages.487-492
/
2017
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Parallel k-Modes Algorithm for Spark Framework

스파크 프레임워크를 위한 병렬적 k-Modes 알고리즘

Chung, Jaehwa

정재화 (한국방송통신대학교 컴퓨터과학과)

Received : 2017.07.11
Accepted : 2017.07.25
Published : 2017.10.31

https://doi.org/10.3745/KTSDE.2017.10.487 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Clustering is a technique which is used to measure similarities between data in big data analysis and data mining field. Among various clustering methods, k-Modes algorithm is representatively used for categorical data. To increase the performance of iterative-centric tasks such as k-Modes, a distributed and concurrent framework Spark has been received great attention recently because it overcomes the limitation of Hadoop. Spark provides an environment that can process large amount of data in main memory using the concept of abstract objects called RDD. Spark provides Mllib, a dedicated library for machine learning, but Mllib only includes k-means that can process only continuous data, so there is a limitation that categorical data processing is impossible. In this paper, we design RDD for k-Modes algorithm for categorical data clustering in spark environment and implement an algorithm that can operate effectively. Experiments show that the proposed algorithm increases linearly in the spark environment.

클러스터링은 빅데이터 분석 및 데이터 마이닝 분야에서 데이터 간 유사성을 파악하기 위해 사용하는 기법으로 다양한 클러스터링 기법 중 범주적 데이터를 위해 k-Modes 알고리즘이 대표적으로 사용된다. k-Modes와 같이 반복적 연산이 집중된 작업의 속도를 향상시키기 위해 많은 관심을 받고 있는 분산 병행 프레임워크 스파크는 하둡과 달리 RDD라는 추상화 객체 개념을 사용하여 대용량의 데이터를 메모리 상에서 처리 가능한 환경을 제공한다. 스파크는 다양한 기계학습을 위한 라이브러리인 Mllib을 제공하고 있으나 연속적 데이터만 처리 가능한 k-means만 포함되어 있어 범주적 데이터 처리가 불가능한 한계가 있다. 따라서 본 논문에서는 스파크 환경에서 범주적 데이터 클러스터링을 위한 k-Modes 알고리즘을 위한 RDD 설계하고 효과적으로 동작할 수 있는 알고리즘을 구현하였다. 실험을 통해 제안한 알고리즘이 스파크 환경에서 선형적으로 증가한다는 것을 보였다.

Keywords

References

Z. Huang, "A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining," In Research Issues on Data Mining and Knowledge Discovery, pp.281-297, 1997.
Y. Sun, Q. Zhu, and Z. Chen, "An Iterative initial points refinement algorithm for categorical data clustering," Pattern Recognition Letters, Vol.23, pp.875-884, 2002. https://doi.org/10.1016/S0167-8655(01)00163-5
P. S. Bradley and U. M. Fayyad, "Refining Initial Points for K-Means Clustering," Proceedings of the 15th International Conference on Machine Learning (ICML98), San Francisco, Morgan Kaufmann, 1998.
S. S. Khan, "A. Ahmad, Cluster center initialization algorithm for Kmeans clustering," Pattern Recognition Letters, Vol.25, No.11, pp.1293-1302, 2004. https://doi.org/10.1016/j.patrec.2004.04.007
S. S. Khan and S. Kant, "Computation of Initial Modes for K-modes Clustering Algorithm using Evidence Accumulation," IJCAI-07, pp.2784-2789, 2007.
Z. Huang, "Extensions to the k-means algorithm for clustering large data sets with categorical values," Data Mining Knowl. Discov., Vol.2, No.2, pp.283-304, 1998. https://doi.org/10.1023/A:1009769707641

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Parallel k-Modes Algorithm for Spark Framework

스파크 프레임워크를 위한 병렬적 k-Modes 알고리즘

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)