Rhipe Platform for Big Data Processing and Analysis

Jung, Byung Ho;Shin, Ji Eun;Lim, Dong Hoon;

doi:10.5351/KJAS.2014.27.7.1171

The Korean Journal of Applied Statistics (응용통계연구)

Volume 27 Issue 7
/
Pages.1171-1185
/
2014
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Rhipe Platform for Big Data Processing and Analysis

빅데이터 처리 및 분석을 위한 Rhipe 플랫폼

Jung, Byung Ho (Department of Information Statistics, Gyeongsang National University) ;
Shin, Ji Eun (Department of Information Statistics, Gyeongsang National University) ;
Lim, Dong Hoon (Department of Information Statistics, Gyeongsang National University)

정병호 (경상대학교 정보통계학과) ;
신지은 (경상대학교 정보통계학과) ;
임동훈 (경상대학교 정보통계학과)

Received : 2014.09.30
Accepted : 2014.12.23
Published : 2014.12.31

https://doi.org/10.5351/KJAS.2014.27.7.1171 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Rhipe that integrates R and Hadoop environment, made it possible to process and analyze massive amounts of data using a distributed processing environment. In this paper, we implemented multiple regression analysis using Rhipe with various data sizes of actual data and simulated data. Experimental results for comparing the computing speeds of pseudo-distributed and fully-distributed modes for configuring Hadoop cluster, showed fully-distributed mode was more fast than pseudo-distributed mode and computing speeds of fully-distributed mode were faster as the number of data nodes increases. We also compared the performance of our Rhipe with stats and biglm packages available on bigmemory. The results showed that our Rhipe was more fast than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.

R과 Hadoop의 통합환경인 Rhipe 개발로 인해 분산처리 환경 하에서 대용량 데이터 분석이 가능해졌다. 본 논문에서는 Rhipe을 이용하여 실제 데이터와 모의실험 데이터에서 다양한 데이터 크기에 따라 다중 회귀분석을 구현하였다. Hadoop의 가상분산 모드(pseudo-dstributed mode)와 완전분산 모드(fully-distributed mode) 구축 시스템 비교에서 완전분산 모드 시스템이 가상분산 모드 시스템보다 처리 속도가 빠르고 데이터 노드의 수가 많을수록 계산 시간이 점점 줄어드는 것을 알 수 있었다. 또한, 제안된 Rhipe 플랫폼의 성능을 평가하기 위해 기본 R 패키지인 stats와 bigmemory 상에서 유용한 biglm 패키지와 처리 속도를 비교하였다. 실험결과 Rhipe은 데이터의 크기가 클수록 map task 개수가 증가되고 동시에 병렬 처리로 인해 다른 패키지들보다 빠른 처리속도를 보였다.

Keywords

References

고영준, 김진석. (2013). Rhipe를 활용한 빅데이터 처리 및 분석, 한국데이터정보과학회지, 24(5), 975-987. https://doi.org/10.7465/jkdi.2013.24.5.975
Adler, D., Nenadic, O. Zucchini, W. and Glaser, C. (2007). The ff package: Handling large data sets in R with memory mapped pages of binary flat files, UseR2007, http://www.r-project.org/conferences/useR-2007/program/presentations/adler.pdf
ASA data expo. (2009). http://stat-computing.org/dataexpo/2009/the-data.html
Ciliendo, E., Kunimasa, T. and Braswell, B. (2007). Linux Performance and Tuning Guidelines, IBM.
Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. PhD thesis, Department of Statistics, Purdue University, West Lafayette.
Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Stat, 191, 53-67.
Hafen, R., Gibson, T., Dam, K. K. and Critchlow, T. (2014). Power grid data analysis with R and Hadoop in Data Mining Applications with R, pp. 1-34.
Kane, M. J. and Emerson, J. W. (2010a). bigmemory: Manage massive matrices with shared memory and memory-mapped files, Rpackage version 4.2.3.
Kane, M. J. and Emerson, J. W. (2010b). biganalytics: A library of utilities for big.matrix objects of package bigmemory , R package version 1.0.12.
Laney, D. (2001)., 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Group.
Lin, H., Yang, S. and Midkiff, S. P. (2013). A Parallel R Framework for Processing Large Dataset on Distributed Systems, DataCloud.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute.
Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
Sammer, E.(2012). Hadoop Operations, O'Reilly Media, Inc, Sebastopol, CA.
White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, Inc, Sebastopol, CA.

Cited by

A Design and Implementation of Web-based System for Real-Time Infographics of Airport Refueling Facilities vol.19, pp.4, 2015, https://doi.org/10.12673/jant.2015.19.4.305
Big data distributed processing system using RHadoop vol.26, pp.5, 2015, https://doi.org/10.7465/jkdi.2015.26.5.1155
Learning algorithms for big data logistic regression on RHIPE platform vol.27, pp.4, 2016, https://doi.org/10.7465/jkdi.2016.27.4.911
RHadoop platform for K-Means clustering of big data vol.27, pp.3, 2016, https://doi.org/10.7465/jkdi.2016.27.3.609

The Korean Journal of Applied Statistics (응용통계연구)

Rhipe Platform for Big Data Processing and Analysis

빅데이터 처리 및 분석을 위한 Rhipe 플랫폼

Abstract

Keywords

References

Cited by

Detail Search