Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment

Yang, Hyeon-Sik;Jang, Miyoung;Chang, Jae-Woo;

doi:10.5392/JKCA.2016.16.07.667

한국콘텐츠학회논문지 (The Journal of the Korea Contents Association)

제16권7호
/
Pages.667-680
/
2016
/
1598-4877(pISSN)
/
2508-6723(eISSN)

한국콘텐츠학회 (The Korea Contents Association)

DOI QR Code

분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment

양현식 (전북대학교 IT정보공학과) ;
장미영 (전북대학교 컴퓨터공학과) ;
장재우 (전북대학교 IT정보공학과)

투고 : 2016.03.04
심사 : 2016.04.19
발행 : 2016.07.28

https://doi.org/10.5392/JKCA.2016.16.07.667 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

하둡 맵리듀스와 같은 분산 컴퓨팅 플랫폼이 개발됨에 따라, 기존 단일 컴퓨터 상에서 수행되는 질의 처리 기법을 분산 컴퓨팅 환경에서 효율적으로 수행하는 것이 필요하다. 특히, 주어진 두 데이터 집합에서 유사도가 높은 모든 데이터 쌍을 탐색하는 유사 조인 질의를 분산 컴퓨팅 환경에서 수행하려는 연구가 있어 왔다. 그러나 분산 병렬 환경에서의 기존 유사 조인 질의처리 기법은 데이터 전송 비용만을 고려하기 때문에 클러스터 간에 비균등 연산 부하 분산의 문제점이 존재한다. 본 논문에서는 분산 컴퓨팅 환경에서 효율적인 유사 조인 처리를 위한 행렬 기반 부하 분산 알고리즘을 제안한다. 제안하는 알고리즘은 클러스터의 균등 부하 분산을 위해 행렬을 이용하여 예상되는 연산 부하를 측정하고 이에 따라 파티션을 생성한다. 아울러, 클러스터에서 질의 처리에 사용되지 않는 데이터를 필터링함으로서 연산 부하를 감소시킨다. 마지막으로 성능 평가를 통해 제안하는 알고리즘이 기존 기법에 비해 질의 처리 성능 측면에서 우수함을 보인다.

As distributed computing platforms like Hadoop MapReduce have been developed, it is necessary to perform the conventional query processing techniques, which have been executed in a single computing machine, in distributed computing environments efficiently. Especially, studies on similarity join query processing in distributed computing environments have been done where similarity join means retrieving all data pairs with high similarity between given two data sets. But the existing similarity join query processing schemes for distributed computing environments have a problem of skewed computing load balance between clusters because they consider only the data transmission cost. In this paper, we propose Matrix-based Load-balancing Algorithm for efficient similarity join query processing in distributed computing environment. In order to uniform load balancing of clusters, the proposed algorithm estimates expected computing cost by using matrix and generates partitions based on the estimated cost. In addition, it can reduce computing loads by filtering out data which are not used in query processing in clusters. Finally, it is shown from our performance evaluation that the proposed algorithm is better on query processing performance than the existing one.

키워드

참고문헌

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler "The hadoop distributed file system," Mass Storage Systems and Technologies (MSST), pp.1-10, 2010.
Jeffrey Dean and Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, Vol.51, Issue.1, pp.107-113, 2010. https://doi.org/10.1145/1327452.1327492
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik, "A primitive operator for similarity joins in data cleaning," Data Engineering, p.5, 2006.
A. Metwally, D. Agrawal, and A. El Abbadi, "DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams," Proceedings of the 16th WWW International Conference on World Wide Web, pp.241-250, 2007.
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, "Syntactic clustering of the web," Computer Networks, pp.1157-1166, 1997.
T. C. Hoad and J. Zobel, "Methods for identifying versioned and plagiarized documents," JASIST, Vol.54, Issue.3, pp.203-215, 2003. https://doi.org/10.1002/asi.10170
Yasin N. Silva and Jason M. Reed, "Exploiting MapReduce-based similarity joins," Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp.693-696, 2012.
Ahmed Metwally and Christos Faloutsos, "V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors," Proceedings of the VLDB Endowment, Vol.5, No.8, pp.704-715, 2012. https://doi.org/10.14778/2212351.2212353
Alper Okcan and Mirek Riedewald, "Processing theta-joins using MapReduce," Proceedings of the 2011 ACM SIGMOD International Conference on Management of data ACM, pp.949-960, 2011.
http://chorochronos.datastories.org/

한국콘텐츠학회논문지 (The Journal of the Korea Contents Association)

분산 컴퓨팅 환경에서 효율적인 유사 조인 질의 처리를 위한 행렬 기반 필터링 및 부하 분산 알고리즘

Matrix-based Filtering and Load-balancing Algorithm for Efficient Similarity Join Query Processing in Distributed Computing Environment

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)