DOI QR코드

DOI QR Code

Analysis on the Active/Inactive Status of Computational Resources for Improving the Performance of the GPU

GPU 성능 저하 해결을 위한 내부 자원 활용/비활용 상태 분석

  • 최홍준 (한국전자통신연구원 부설연구소) ;
  • 손동오 (전남대학교 전자컴퓨터공학부) ;
  • 김종면 (울산대학교 전기공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학부)
  • Received : 2015.01.19
  • Accepted : 2015.05.27
  • Published : 2015.07.28

Abstract

In recent high performance computing system, GPGPU has been widely used to process general-purpose applications as well as graphics applications, since GPU can provide optimized computational resources for massive parallel processing. Unfortunately, GPGPU doesn't exploit computational resources on GPU in executing general-purpose applications fully, because the applications cannot be optimized to GPU architecture. Therefore, we provide GPU research guideline to improve the performance of computing systems using GPGPU. To accomplish this, we analyze the negative factors on GPU performance. In this paper, in order to clearly classify the cause of the negative factors on GPU performance, GPU core status are defined into 5 status: fully active status, partial active status, idle status, memory stall status and GPU core stall status. All status except fully active status cause performance degradation. We evaluate the ratio of each GPU core status depending on the characteristics of benchmarks to find specific reasons which degrade the performance of GPU. According to our simulation results, partial active status, idle status, memory stall status and GPU core stall status are induced by computational resource underutilization problem, low parallelism, high memory requests, and structural hazard, respectively.

최신 고성능 컴퓨팅 시스템에서는, 대용량 병렬 연산을 효과적으로 처리할 수 있는 GPU의 우수한 연산 성능을 그래픽 처리 이외의 범용 작업에 활용하는 GPGPU 기술에 관한 연구가 활발하게 진행 중이다. 하지만 범용 응용프로그램의 특성이 GPU 구조에 최적화되어 있지 않기 때문에 범용 프로그램 수행 시 GPGPU는 GPU의 연산 자원을 효과적으로 활용하지 못하고 있다. 그러므로 본 논문에서는 GPGPU 기술을 사용하는 컴퓨팅 시스템의 성능을 보다 향상시킬 수 있는 GPU 연구에 대한 방향을 제시하고자 한다. 이를 위하여, 본 논문에서는 GPU 성능 저하 원인 분석을 수행한다. GPU 성능 저하 원인을 보다 명확하게 분류하고자 본 논문에서는 GPU 코어의 상태를 완전 활성화 상태, 불완전 활성화 상태, 유휴 상태, 메모리스톨 상태, 그리고 GPU 코어 스톨 상태 등 5가지로 정의하였다. 완전 활성화 상태를 제외한 모든 GPU 코어 상태들은 컴퓨팅 시스템의 성능 저하를 유발한다. 본 논문에서 성능 저하 원인을 찾고자 벤치마크 프로그램의 특성에 따라 각 GPU 코어 상태의 비율 변화를 측정하였다. 분석 결과에 따르면, 불완전 활성화 상태, 유휴 상태, 메모리 스톨 상태 그리고 GPU 코어 스톨 상태는 연산 자원 활용률 저하, 낮은 프로그램 병렬성, 높은 메모리 요청, 그리고 구조적 해저드에 의해 각각 유발된다.

Keywords

References

  1. K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp.2-11, 1996.
  2. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of International Symposium on Computer Architecture, pp.248-259, 2000.
  3. H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, "Dark Silicon and the End of Multicore Scaling," In Proceedings of International Symposium on Computer Architecture, pp.365-376, 2011.
  4. iSuppli Market Research, available at http://www.isuppli.com/
  5. M. D. Hill and M. R. Marty, "Amdahl's law in the multicore era," IEEE Computer, Vol.41, No.7, pp.33-38, 2008.
  6. Y. H. Jang, C. Park, J. H. Park, N. Kim, and K. H. Yoo, "Parallel Processing for Integral maging Pickup using Multiple Threads," nternational Journal of Korea Contents, Vol.5, No.4, pp.30-34, 2009. https://doi.org/10.5392/IJoC.2009.5.4.030
  7. Y. H. Jang, C. Park, J. S. Jung, J. H. Park, N. Kim, J. S. Ha, and K. H. Yoo, "Integral Imaging Pickup Method of Bio-Medical Data using GPU and Octree," International Journal of Korea Contents, Vol.10, No.9, pp.1-9, 2009.
  8. NVIDIA Corporation, available at http://www.nvidia.com/
  9. NVIDIA's Next Generation CUDA Compute Arc hitecture: Fermi, available at http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
  10. H. J. Choi, D. O. Son, J. M. Kim, and C. H. Kim, "Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization," Journal of SuperComputing, Vol.69, No.1, pp.330-356, 2014. https://doi.org/10.1007/s11227-014-1155-4
  11. I. Buck, "Gpu computing with nvidia cuda," In Proceedings of International Conference on Special Interest Group on Computer Graphics and Interactive Techniques(SIGGRAPH), p.6, 2007.
  12. T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy, and S. Hahn, "Operating System Support for Overlapping-ISA Heterogeneous Multi-core Architectures," In Proceedings of International Symposium on High Performance Computer Architecture, pp.1-12, 2010.
  13. Performance Comparison between CPU and GP U, Available at http://www.ncsa.illinois.edu/-kindr/projects/hpca/files/ppac09_presentation.pdf
  14. V. W. Lee, C. K. Kim, J. Chhugani, M. Deisher, D. H. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," In Proceedings of International Symposium on Computer Architecture, pp.451-460, 2010.
  15. General-purpose computation on graphics hardware, available at http://www.gpgpu.org
  16. Y. Zhang and J. D. Owens, "A Quantitative Performance Analysis Model for GPU Architectures," In Proceedings of International Symposium on High Performance Computer Architecture, pp.382-393, 2011.
  17. E. Blem, M. Sinclair, and K. Sankaralingam, "Challenge Benchmarks That Must be Conquered to Sustain the GPU Revolution," In Proceedings of Workshop on Emerging Applications for Manycore Architecture, 2010
  18. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of Microarchitecture, pp.407-420, 2007.
  19. V. Narasiman, C. J. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," In Proceedings of international symposium on Microarchitecture, pp.308-317, 2011.
  20. J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," In Proceedings of International Symposium on Computer Architecture, pp.235-246, 2010.
  21. W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," In Proceedings of International Symposium on High Performance Computer Architecture, pp.25-36, 2011.
  22. O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," In Proceedings of International Symposium on Computer Architecture, pp.63-74, 2008.
  23. A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pp.395-406, 2013.
  24. H. J. Choi, H. G. Jeon, and C. H. Kim, "Quantitative Analysis of the Negative Factors on the GPU Performance," Journal of KIISE : Computing Practices and Letters, Vol.18, No.4, pp.282-287, 2012.
  25. J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch : Enabling Energy Optimizations in GPGPUs," In Proceedings of International Symposium on Computer Architecture, pp.487-498, 2013.
  26. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
  27. NVIDIA SDK, available at http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
  28. S. Che, M. Boyer, M. Jiayuan, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K.Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," In Proceedings of the International Symposium on Workload Characterization, pp.44-54, 2009.