DOI QR코드

DOI QR Code

Distributed Processing of Big Data Analysis based on R using SparkR

SparkR을 이용한 R 기반 빅데이터 분석의 분산 처리

  • Ryu, Woo-Seok (Dept. of Health Care Management, Catholic University of Pusan)
  • 류우석 (부산가톨릭대학교 병원경영학과)
  • Received : 2021.11.29
  • Accepted : 2022.02.17
  • Published : 2022.02.28

Abstract

In this paper, we analyze the problems that occur when performing the big data analysis using R as a data analysis tool, and present the usefulness of the data analysis with SparkR which connects R and Spark to support distributed processing of big data effectively. First, we study the memory allocation problem of R which occurs when loading large amounts of data and performing operations, and the characteristics and programming environment of SparkR. And then, we perform the comparison analysis of the execution performance when linear regression analysis is performed in each environment. As a result of the analysis, it was shown that R can be used for data analysis through SparkR without additional language learning, and the code written in R can be effectively processed distributedly according to the increase in the number of nodes in the cluster.

본 논문에서는 데이터 분석 도구인 R을 이용하여 빅데이터 분석을 수행할 때 발생하는 문제점을 분석하고, 빅데이터의 분산 처리를 효과적으로 지원하는 스파크와 R을 연계한 SparkR을 이용한 분석의 유용성을 제시하고자 한다. 먼저, 대량의 데이터를 로딩하고 연산을 수행할 때 발생하는 R의 메모리 할당 문제점과 R과 비교한 SparkR의 특징 및 프로그래밍 환경을 분석한다. 그리고, 선형 회귀 분석을 각각의 환경에서 수행할 때의 실행 성능을 비교 분석한다. 분석 결과 SparkR을 통해 추가적인 언어 학습 없이도 R을 그대로 이용하여 데이터 분석에 활용할 수 있음을 보였으며, SparkR을 이용하여 R로 작성된 코드를 클러스터 내 노드 수의 증가에 따라 효과적으로 분산 처리할 수 있었다.

Keywords

Acknowledgement

이 논문은 2019년도 부산가톨릭대학교 교내연구비에 의하여 연구되었음

References

  1. A. Rabasa and C. Heavin, An Introduction to Data Science and its Applications, Data Science and Productivity Analytics. Berlin: Springer Cham, 2020, pp. 57-81.
  2. Y. Lim and K. Kim, "Methods to propel Tourism of Yeosu City Using Big Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 15, no. 4, Aug. 2020, pp. 739-746. https://doi.org/10.13067/JKIECS.2020.15.4.739
  3. K. Goztepe, "De Facto Language of Data Science: The R Project," J. of Management and Information Science, vol. 4, no. 4, Dec. 2016, pp. 104-107.
  4. M. Cho, "A Comparative Study on the Accuracy of Important Statistical Prediction Techniques for Marketing Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 775-780. https://doi.org/10.13067/JKIECS.2019.14.4.775
  5. B. Chambers and M. Zaharia, Spark: The definitive Guide: Big data processing made simple. Newton, MA, USA: O'Reilly Media, Inc, Feb. 2018.
  6. M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, and A. Ghodsi, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, Oct. 2016, pp. 56-65. https://doi.org/10.1145/2934664
  7. J. Jang, J. Park, H. Kim, and S. Yoon, "A Comparative Performance Analysis of Spark-Based Distributed Deep-Learning Frameworks," KIISE(Korean Institute of Information Scientists and Engineers) Trans. Computing Practices, vol. 23, no. 5, May, 2017, pp. 299-303.
  8. M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, "Big data machine learning using apache spark MLlib," In 2017 IEEE Int. Conf. on Big Data (Big Data), Boston, MA, U.S.A., 2017, pp. 3492-3498.
  9. W. Ryu, "Performance Factor of Distributed Processing of Machine Learning using Spark," J. of the Korea Institute of Electronic Communication Sciences, vol. 16, no. 1, Feb. 2021, pp. 19-24. https://doi.org/10.13067/JKIECS.2021.16.1.19
  10. R. Myung, H. Yu, and S. Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS(Korea Information Processing Society) Trans. Computer and Communication Systems, vol. 7, no. 1, Jan. 2018, pp. 9-18.