DOI QR코드

DOI QR Code

확장형 실시간 데이터 파이프라인 시스템 아키텍처 설계

Design of Extended Real-time Data Pipeline System Architecture

  • 신호승 (한국과학기술원 전산학과) ;
  • 강성원 (한국과학기술원 전산학과) ;
  • 이지현 (대전대학교 리버럴아츠칼리지)
  • 투고 : 2015.03.12
  • 심사 : 2015.06.01
  • 발행 : 2015.08.15

초록

빅데이터 시스템은 대규모 로그 데이터를 수집하는 용도로 광범위하게 사용되고 있기 때문에 높은 성능을 갖는 것이 매우 중요하지만, 현재의 Hadoop 기반의 빅데이터 시스템은 중복 처리로 인하여 낮은 성능을 갖는 아키텍처적인 문제를 가지고 있다. 본 논문은 아키텍처 설계 개선을 통하여 Hadoop 기반 시스템의 낮은 성능 문제를 해결한다. 새로운 제안 아키텍처는 기존 아키텍처의 배치(Batch) 기반의 데이터 수집 방식을 개별처리 방식과 혼합한 수집 방법을 사용하고, 수집하는 데이터를 In-Memory 상에서 직접 분석하여 중복 처리를 배제하여 높은 성능을 제공하게 한다. 또한 제안 아키텍처는 기존 Hadoop 기반 아키텍처의 장점인 시스템 확장성을 가진다. 본 논문은 제안 아키텍처가 테스트 베드 환경에서 기존 아키텍처보다 데이터의 분석 처리 속도가 30%~35% 빠르고 확장성도 가진다는 것을 확인하였다.

Big data systems are widely used to collect large-scale log data, so it is very important for these systems to operate with a high level of performance. However, the current Hadoop-based big data system architecture has a problem in that its performance is low as a result of redundant processing. This paper solves this problem by improving the design of the Hadoop system architecture. The proposed architecture uses the batch-based data collection of the existing architecture in combination with a single processing method. A high level of performance can be achieved by analyzing the collected data directly in memory to avoid redundant processing. The proposed architecture guarantees system expandability, which is an advantage of using the Hadoop architecture. This paper confirms that the proposed architecture is approximately 30% to 35% faster in analyzing and processing data than existing architectures and that it is also extendable.

키워드

과제정보

연구 과제 주관 기관 : 한국연구재단

참고문헌

  1. B. T. Rao, N. V. Sridevi, V. K. Reddy, and L. S. S. Reddy, "Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing," Global Journal of Computer Science and Technology, Vol. 11, No. 8, May. 2011.
  2. IDG, 2014 Big Data in Korea, Where Are We. Available: http://www.itworld.co.kr/techlibrary/89296, (downloaded 2014, Sep. 2)
  3. M. T. Schmidt, B. Hutchison, and P. Lambros, "The Enterprise Service Bus: Making Service-Oriented Architecture Real," IBM Systems Journal, Vol. 44, No. 4, pp. 781-797, 2005. https://doi.org/10.1147/sj.444.0781
  4. T. Sakaki, M. Okazaki, and Y. Matsuo, "Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors," Proc. of the 19th International Conference on World Wide Web 2012, pp. 851-860, 2012.
  5. X. Amatriain and J. Basilico. (2013, Mar. 27), System Architectures for Personalization and Recommendation [Online]. Available: http://techblog.netflix.com/2013/03/system-architectures-for.html (accessed 2014, Sep. 2)
  6. D. Carasso (2012) Exploring Splunk [Online]. Available: https://es.splunk.com/web_assets/v5/book/Exploring_Splunk.pdf, (downloaded 2014, Sep. 2)
  7. F. Farber, S.K. Cha, J. Primsch, C. Bornhovd, S. Sigg, and W. Lehner, "SAP HANA Database: Data Management for Modern Business Applications," ACM SIGMOD Record, Vol. 40, No. 4, pp. 45-51, Dec. 2011. https://doi.org/10.1145/2094114.2094126
  8. K. Shvachko, H. Kuang, and S. Radia, "The Hadoop Distributed File System," Mass Storage Systems and Technologies, pp. 1-10, May 2010.
  9. M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, "Improving MapReduce Performance in Heterogeneous Environments," Proc of 8th USENIX Conference on Operating Systems Design and Implementation, pp. 29-42, 2008.
  10. B. M. Michelson, "Event-Driven Architecture Overview," Patricia Seybold Group, 2006.
  11. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-memory Cluster," Proc. of 9th USENIX Conference on Networked Systems Design and Implementation, pp. 1-14, 2012.
  12. K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Yang Ye, "Building LinkedIn's Real-time Activity Data Pipeline," IEEE Computer Society Technical Committee on Data Engineering, 2012.
  13. J. Kreps, "The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction," LinkedIn, Dec. 2013.
  14. J. Kreps, N. Narkhede, Jun Rao, "Kafka: a Distributed Messaging System for Log Processing," Proc. of NetDB 2011, Dec. 2011.
  15. S. Kang, Invitation to Software Architecture, Revised Ed., HongRung Publishing Company, 2015. (in Korean)
  16. A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. Stonebraker, "A Comparison of Approaches to Large-Scale Data Analysis," Proc. of SIGMOD 2009, pp. 165-178, Jun. 2009.
  17. U.C Berkeley AMP Lab. (2014). Big Data Benchmark [Online]. Available: https://amplab.cs.berkeley.edu/benchmark, (accessed 2014, Sep. 2)