DOI QR코드

DOI QR Code

Performance Evaluation of the GPU Architecture Executing Parallel Applications

병렬 응용프로그램 실행 시 GPU 구조에 따른 성능 분석

  • 최홍준 (전남대학교 전자컴퓨터공학과) ;
  • 김철홍 (전남대학교 전자컴퓨터공학과)
  • Received : 2012.02.14
  • Accepted : 2012.04.20
  • Published : 2012.05.28

Abstract

The role of GPU has evolved from graphics-specific processing to general-purpose processing with the development of unified shader core architecture. Especially, execution methods for general-purpose parallel applications using GPU have been researched intensively, since the parallel hardware architecture can be utilized efficiently when the parallel applications are executed. However, current GPU architecture has limitations in executing general-purpose parallel applications, since the GPU is not specialized for general-purpose computing yet. To improve the GPU performance when general-purpose parallel applications are executed, the GPU architecture should be evolved. In this work, we analyze the GPU performance according to the architecture varying the number of cores and clock frequency. Our simulation results show that the GPU performance improves by up to 125.8% and 16.2% as the number of cores increases and the clock frequency increases, respectively. However, note that the improvement of the GPU performance is saturated even though the number of cores increases and the clock frequency increases continuously, since the data cannot be provided to the GPU due to the limit of memory bandwidth. Consequently, to accomplish high performance effectiveness on GPU, computational resources must be more suitably considered.

통합형셰이더 코어 구조 개발 이후 GPU는 그래픽스 전용 연산장치에서 범용 연산장치로 발달하고 있다. 특히, 병렬 응용 프로그램들은 병렬화된 하드웨어 구조를 효과적으로 활용할 수 있기 때문에, GPU를 활용하여 병렬 응용프로그램들을 실행시키는 기법이 주목을 받고 있다. 하지만, 현재의 GPU 구조는 비그래픽스 응용프로그램을 실행하는데 있어서 병렬성을 충분히 확보하지 못하다는 한계를 가지고 있기 때문에, 이를 해결하기 위해 GPU 구조는 빠르게 변화하고 있다. 본 논문에서는 GPU 구조의 개발 방향을 살펴보기 위해, 비그래픽스 병렬 응용프로그램들을 수행하는 경우에 코어 개수 및 동작 주파수 등의 하드웨어구조에 따른 GPU의 성능을 상세히 분석하고자 한다. 실험 결과, 코어 개수가 30에서 192로 늘어나고 동작주파수가 325MHz에서 450MHz로 증가함에 따라 GPU 성능은 28.9%에서 125.8%, 4.4%에서 16.2% 각각 향상되는 반면 성능 향상 효율성은 감소하는 것을 볼 수 있다. 성능 향상 효율성 감소의 주된 원인은 향상된 연산 능력에 맞추어 증가된 데이터 요구를 메모리가 적절하게 처리하지 못하기 때문이다. 결과적으로 GPU의 성능 향상 효율성을 더욱 높이기 위해서는 연산 능력 향상과 더불어 시스템 자원들 또한 GPU 구조에 맞게 변경되어야 함을 구체적인 실험을 통해 알 수 있다.

Keywords

References

  1. http://www.khronos.org/opencl/
  2. K. Gray, The Microsoft DirectX 9 Programmable Graphics Pipeline, Microsoft Press, 2003.
  3. http://www.nvidia.co.kr/page/geforce8.html
  4. NVIDIA CUDA$^{TM}$, Programming Guide Version 2.3.1, NVIDIA Corporation, 2009.
  5. http://www.gpgpu.org
  6. Y. H. Jang, C. Park, J. H. Park, N. Kim, and K. H. Yoo, "Parallel Processing for Integral Imaging Pickup using Multiple Threads," International Journal of Korea Contents, Vol.5, No.4, pp.30-34, 2009. https://doi.org/10.5392/IJoC.2009.5.4.030
  7. Y. H. Jang, C. Park, J. S. Jung, J. H. Park, N. Kim, J. S. Ha, and K. H. Yoo, "Integral Imaging Pickup Method of Bio-Medical Data using GPU and Octree," International Journal of Korea Contents, Vol.10, No.9, pp.1-9, 2009. https://doi.org/10.5392/JKCA.2010.10.6.001
  8. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE MICRO, Vol.28, No.2, pp.39-55, 2008. https://doi.org/10.1109/MM.2008.31
  9. Yao Zhang and John D. Owens, "A Quantitative Performance Analysis Model for GPU Architectures," In Proceedings of International Symposium on High Performance Computer Architecture, pp.382-393, 2011.
  10. Veynu Narasiman, Chang Joo Lee, Michael Shebanow, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," In Proceedings of international symposium on Microarchitecture, 2011.
  11. Wilson W. L. Fung, Inderpreet Singh, A. Brownsword, and Tor M. Aamodt, "Hardware Transactional Memory for GPU Architectures," In Proceedings of international symposium on Microarchitecture, 2011.
  12. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
  13. http://developer.download.NVIDIA.com/compute/cuda/sdk/website/samples.html
  14. http://lava.cs.virginia.edu/Rodinia/
  15. http://www.NVIDIA.com/object/product_quadro_fx_5800_us.html
  16. http://www.datasheetarchive.com/samsung%20gddr5-datasheet.html
  17. http://nocs.stanford.edu/booksim.html

Cited by

  1. Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications vol.13, pp.3, 2013, https://doi.org/10.5392/JKCA.2013.13.03.009
  2. Analysis on the GPU Performance according to Hierarchical Memory Organization vol.14, pp.3, 2014, https://doi.org/10.5392/JKCA.2014.14.03.022