Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic

Kim, Wooyeol;Shim, Kyuseok;

doi:10.5626/JOK.2016.43.2.229

정보과학회 논문지 (Journal of KIISE)

제43권2호
/
Pages.229-236
/
2016
/
2383-630X(pISSN)
/
2383-6296(eISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

DOI QR Code

그리디 알고리즘을 이용한 맵리듀스 세타조인 M-Bucket-I 휴리스틱의 개선

Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic

김우열 (서울대학교 전기정보공학부) ;
심규석 (서울대학교 전기정보공학부)

Kim, Wooyeol ;
Shim, Kyuseok (Seoul National Univ.)

투고 : 2015.07.27
심사 : 2015.11.08
발행 : 2016.02.15

https://doi.org/10.5626/JOK.2016.43.2.229 인용 KSCI

⟨ 이전 논문 다음 논문 ⟩

초록

세타조인은 데이터베이스에 있어서 가장 기본적이면서도 중요한 질의 중 하나이다. 최근 처리해야 하는 데이터의 양이 증가함에 따라, 맵리듀스와 같은 분산 병렬 처리 프레임워크를 사용한 데이터베이스의 질의처리가 많이 연구되고 있다. 대표적인 연구로 M-Bucket-I 휴리스틱을 이용한 세타조인이 있으나, 이 알고리즘은 수행시간이 입력 데이터의 크기를 n이라 할 때, 각 레코드를 $r_{max}$개의 리듀서 중 어느 리듀서로 보낼지 정하는 리듀서맵핑을 구하는데 O(n)의 시간이 걸려 쉽게 사용할 수 없다는 문제가 있다. 본 논문에서는 기존의 M-Bucket-I 휴리스틱을 개선하여, 같은 리듀서 매핑 결과를 내놓더라도 수행시간이 $O(r_{max}log\;n)$으로 보다 짧은 새로운 알고리즘을 제시한다. 다양한 실험을 통하여 기존의 맵리듀스를 이용한 세타조인보다 성능을 10% 정도 향상시킬 수 있음을 보였다.

Theta join is one of the essential and important types of queries in database systems. As the amount of data needs to be processed increases, processing theta joins with a single machine becomes impractical. Therefore, theta join algorithms using distributed computing frameworks have been studied widely. Although one of the state-of-the-art theta-join algorithms uses M-Bucket-I heuristic, it is hard to use since running time of M-Bucket-I heuristic, which computes a mapping from a record to a reducer (i.e., reducer mapping), is O(n) where n is the size of input data. In this paper, we propose MBI-I algorithm which reduces the running time of M-Bucket-I heuristic to $O(r_{max}log\;n)$ and gives the same result as M-Bucket-I heuristic does. We also conducted several experiments to show algorithm and confirmed that our algorithm can improve the performance of a theta join by 10%.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단, 정보통신기술진흥센터

참고문헌

J. Dean, and S. Ghemawat, "MapReduce: simplified data processing on large cluster," OSDI, 2004.
A. Okcan, and M. Riedewald, "Processing Thetajoins using MapReduce," SIGMOD, pp. 949-960, 2011.
J. Son, J. Lee, Y. Kim, and K. Shim, "Streaming Theta-Join Algorithm using MapReduce," Proc. of the 40th KIISE Fall Conference, pp. 182-184, 2013. (in Korean)
D. Jiang, B. C. Ooi, L. Shi, and S. Wu, "The Performance of MapReduce: An In-depth Study," Proc. of the VLDB Endowment, Vol. 3, No. 1, pp. 472-483, 2010.
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita and Y. Tian, "A Comparison of Join Algorithms for Log Processing in MapReduce," SIGMOD, pp. 975-986, 2010.
A. Metwally, and C. Faloutsos, "V-SMART-Join, A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors," PVLDB, pp. 704-715, 2012.
D. Deng, G. Li, S. Hao, J. Wang and J. Feng, "MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins," ICDE, pp. 340-351, 2014.
S. Fries, B. Boden, G. Stepien and T. Seidl, "PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce," ICDE, pp. 796-807, 2014.
X. Zhang, L. Chen, M. Wang, "Efficient Multi-way Theta-Join Processing Using MapReduce," PVLDB, pp. 1184-1105, 2012.
C. Hahn and S. Warren, Extended edited synoptic cloud reports from shps and land statins over the globe, 1952-1996, http://cdiac.ornl.gov/ftp/ndp026c/

정보과학회 논문지 (Journal of KIISE)

그리디 알고리즘을 이용한 맵리듀스 세타조인 M-Bucket-I 휴리스틱의 개선

Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)