DOI QR코드

DOI QR Code

Efficient Computation of Data Cubes Using MapReduce

맵리듀스를 사용한 데이터 큐브의 효율적인 계산 기법

  • 이기용 (숙명여자대학교 컴퓨터과학부) ;
  • 박소정 (숙명여자대학교 컴퓨터과학부) ;
  • 박은주 (숙명여자대학교 컴퓨터과학부) ;
  • 박진경 (숙명여자대학교 컴퓨터과학부) ;
  • 최연정 (숙명여자대학교 컴퓨터과학부)
  • Received : 2014.07.29
  • Accepted : 2014.09.04
  • Published : 2014.11.30

Abstract

MapReduce is a programing model used for parallelly processing a large amount of data. To analyze a large amount data, the data cube is widely used, which is an operator that computes group-bys for all possible combinations of given dimension attributes. When the number of dimension attributes is n, the data cube computes $2^n$ group-bys. In this paper, we propose an efficient method for computing data cubes using MapReduce. The proposed method partitions $2^n$ group-bys into $_nC_{{\lceil}n/2{\rceil}}$ batches, and computes those batches in stages using ${\lceil}n/2{\rceil}$ MapReduce jobs. Compared to the existing methods, the proposed method significantly reduces the amount of intermediate data generated by mappers, so that the cost of sorting and transferring those intermediate data is reduced significantly. Consequently, the total processing time for computing a data cube is reduced. Through experiments, we show the efficiency of the proposed method over the existing methods.

맵리듀스(MapReduce)는 대용량 데이터를 다수의 컴퓨터로 병렬 처리하는 데 사용되는 프로그래밍 모델이다. 데이터 큐브(Data Cube)는 대용량 데이터 분석에 널리 사용되는 연산자로서, 주어진 차원 애트리뷰트들의 모든 가능한 조합에 대한 group-by들을 계산한다. 차원 애트리뷰트의 개수가 n일 때, 데이터 큐브는 총 $2^n$개의 group-by를 계산한다. 본 논문은 맵리듀스를 사용하여 데이터 큐브를 효율적으로 계산하는 방법을 제안한다. 제안 방법은 $2^n$ 개의 group-by를 $_nC_{{\lceil}n/2{\rceil}}$개의 그룹으로 분할하고, 이 그룹들을 ${\lceil}n/2{\rceil}$개의 맵리듀스 잡(job)을 통해 단계적으로 계산한다. 제안 방법은 기존 방법에 비해 맵퍼(mapper)가 생성하는 중간결과의 크기를 크게 줄임으로써 중간결과의 전송 및 정렬에 드는 비용을 크게 줄인다. 그에 따라 데이터 큐브를 계산하는 총 수행시간이 크게 감소된다. 실험을 통해 제안 방법이 기존 방법에 비해 더 빠르게 데이터 큐브를 계산함을 보인다.

Keywords

References

  1. http://en.wikipedia.org/wiki/Big_data
  2. Mark Beyer, "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data," Gartner, June 27, 2011.
  3. Jeffrey Dean, Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters," In Proceedings of OSDI '04, pp.137-150, 2004.
  4. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals," In Proceedings of the ICDE Conference, pp.152-159, 1996.
  5. Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan, "Data Cube Materialization and Mining over MapReduce," IEEE Transactions on Knowledge and Data Engineering, Vol.24, No.10, pp.1747-1759, 2012. https://doi.org/10.1109/TKDE.2011.257
  6. Zhengkui Wang, Yan Chu, Kian-Lee Tan, Divyakant Agrawal, Amr EI Abbadi, and Xiaolong Xu, "Scalable Data Cube Analysis over Big Data," CoRR, abs/1311.5663, 2013.
  7. Wang, Yuxiang, Aibo Song, and Junzhou Luo, "A mapreducemerge-based data cube construction metho," In Proceedings of IEEE International Conference on Grid and Cooperative Computing(GCC), pp.1-6, 2010.
  8. Sergey, Kuznecov, and Kudryavcev Yury, "Applying map-reduce paradigm for parallel closed cube computation," In Proceedings of IEEE International Conference on Advances in Databases, Knowledge, and Data Applications, pp.62-67, 2009.
  9. You, Jinguo, Jianqing Xi, and Pingjian Zhang, "A parallel algorithm for closed cube computation," In Proceedings of IEEE/ACIS International Conference on Computer and Information Science, pp.95-99, 2008.
  10. Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman, "Implementing Data Cubes Efficiently," In Proceedings of ACM SIGMOD, pp.205-216, 1996.
  11. Kevin Beyer, Raghu Ramakrishnan, "Bottom-Up Computation of Sparse and Iceberg Cube," In Proceedings of ACM SIGMOD, pp.359-370, 1999.
  12. Guoping Wang, Chee-yong Chan, "Multi-Query Optimization in MapReduce Framework," PVLDB, Vol.7, No.3, pp.145-156, 2013.
  13. http://aws.amazon.com/ec2/
  14. Lakshmanan, Laks VS, Jian Pei, and Jiawei Han, "Quotient cube: How to summarize the semantics of a data cube," In Proceedings of the 28th international conference on Very Large Data Bases, pp.778-789, 2002.
  15. Wang, Wei, Jianlin Feng, Hongjun Lu, and Jeffrey Xu Yu, "Condensed cube: An effective approach to reducing data cube size," In Proceedings of IEEE International Conference on Data Engineering, pp.155-165, 2002.

Cited by

  1. Efficient Processing of Multiple Group-by Queries in MapReduce for Big Data Analysis vol.21, pp.5, 2015, https://doi.org/10.5626/KTCP.2015.21.5.387