DOI QR코드

DOI QR Code

Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications

범용 응용프로그램 실행 시 하드웨어 구성과 분기 처리 기법에 따른 GPU 성능 분석

  • 최홍준 (전남대학교 전자컴퓨터공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학부)
  • Received : 2013.01.07
  • Accepted : 2013.03.07
  • Published : 2013.03.28

Abstract

Due to increased computing power and flexibility of GPU, recent GPUs execute general purpose parallel applications as well as graphics applications. Programmers can use GPGPU by using the APIs from GPU vendors. Unfortunately, computational resources of GPU are not fully utilized when executing general purpose applications because of frequent branch instructions. To handle the branch problem, several warp formations have been proposed. Intuitively, we expect that the warp formations providing higher computational resource utilization show higher performance. Contrary to our expectations, according to simulation results, the performance of the warp formation providing better utilization is lower than that of the warp formation providing worse utilization. This is because warp formation providing high utilization causes serious memory bottleneck due to increased memory request. Therefore, warp formation providing high computation utilization cannot guarantee high performance without proper hardware resources. For this reason, we will analyze the correlation between hardware configuration and warp formation. Our simulation results present the guideline to solve the underutilization problem due to branch instructions when designing recent GPU.

GPU의 연산 능력과 유연성이 강화됨에 따라, GPU는 그래픽 응용프로그램뿐만 아니라 범용 응용프로그램도 수행한다. 특히, GPU 회사들이 제공하는 API를 활용함으로써 프로그래머들은 보다 쉽게 GPGPU 응용프로그램을 작성할 수 있다. 하지만 대부분의 범용 응용프로그램은 분기 명령어를 많이 포함하고 있기 때문에, 범용 응용프로그램을 수행하는 경우 GPU의 연산 자원을 충분히 활용할 수 없다. 분기 명령어를 처리하기 위해서 다양한 워프 생성 기법들이 제안되었다. GPU 구조에서는 높은 연산 자원 활용률을 보이는 워프 생성기법이 우수한 성능을 보일 것으로 예상된다. 하지만 예상과는 달리, 실험 결과에 따르면 높은 연산 자원 활용률을 보이는 워프 생성 기법의 성능이 상대적으로 낮은 연산 자원 활용률을 보이는 워프 생성 기법의 성능보다 낮게 나타난다. 높은 연산 자원 활용률을 보이는 워프 생성 기법에서 유발한 많은 메모리 요구로 인한 심각한 메모리 병목 현상이 원인으로 분석된다. 그러므로 적절한 하드웨어 지원이 없는 경우, 높은 연산자원 활용률이 반드시 우수한 성능을 보장한다고 할 수 없다. 이러한 이유로, 본 논문에서는 하드웨어 자원과 워프 생성 기법사이의 상관관계에 대한 상세한 분석을 수행하고자 한다. 본 논문의 분석 결과는 분기 명령어에 의해 발생된 GPU의 성능 저하 문제를 해결하고자 할 때 중요한 가이드라인이 될 것이다.

Keywords

References

  1. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microArchitectures," In Proceedings of 27th International Symposium on Computer Architecture, pp.248-259, 2000.
  2. N. P. Jouppi and D. W. Wall, "Available instruction-level parallelism for superscalar and superpipelined machines," In Proceedings of 3th International Conference on Architectural Support for Programming Languages and Operating Systems, pp.272-282, 1989.
  3. D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous multithreading: maximizing on-chip parallelism," In Proceedings of 22th International Symposium on Computer Architecture, pp.392-403, 1995.
  4. Y. H. Jang, C. Park, J. H. Park, N. Kim, and K. H. Yoo, "Parallel Processing for Integral Imaging Pickup using Multiple Threads," International Journal of Korea Contents, Vol.5, No.4, pp.30-34, 2009. https://doi.org/10.5392/IJoC.2009.5.4.030
  5. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of 31th Annual Conference on Computer Graphics (SIGGRAPH), pp.777-786, 2004.
  6. E. Lindholm, M. J. Kligard, and H. P. Moreton, "A user-programmable vertex engine," In Proceedings of 28th Annual Conference on Computer Graphics (SIGGRAPH), pp.149-158, 2001.
  7. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware," Eurographics 2005, State of the Art Reports, pp.21-51, 2005.
  8. http://developer.nvidia.com/object/cuda_3_1_do wnloads.html
  9. http://www.khronos.org/opencl/
  10. J. Helin, "Performance analysis of the CM-2, a massively parallel SIMD computer," In Proceedings of 6th International Conference on Supercomputing, pp.45-52, 1992.
  11. A. Levinthal and T. Porter, "Chap-a SIMD graphics processor," In Proceedings of 11th Annual Conference on Computer Graphics (SIGGRAPH), pp.77-82, 1984.
  12. S. Che, J. Meng, J. Sheaffer, and K. Skadron, "A performance study of general purpose applications on graphics processors using CUDA," Journal of Parallel and Distributed Computing, Vol.68, No.10, pp.1370-1380, 2008. https://doi.org/10.1016/j.jpdc.2008.05.014
  13. R. A. Lorie and H. R. Strong, "Method for conditional branch execution in SIMD vector processors," US Patent 4435758, Vol.6, 1984(3).
  14. S. Moy and E. Lindholm, "Method and system for programmable pipelined graphics processing with branching instructions," US Patent 6947047, Vol.20, 2005(9).
  15. E. Rotenberg, Q. Jacobson, and J. E. Smith, "A study of control independence in superscalar processors," In Proceedings of 5th International Symposium on High-Performance Computer Architecture, pp.115-124, 1999.
  16. B. W. Coon and J. E. Lindholm, "System and method for managing divergent threads in a SIMD architecture," US Patent 7353369, Vol.1, 2008(4).
  17. E. Rotenberg, Q. Jacobson, and J. Smith, "A study of control independence in superscalar processors," In Proceedings of 5th International Symposium on High-Performance Computer Architecture, pp.115-124, 1999.
  18. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of 40th Microarchitecture, pp.407-420, 2007.
  19. H. J. Choi and C. H. Kim, "Performance Evaluation of the GPU Architecture Executing Parallel Applications," Journal of the Korea Contents Association, Vol.12, No.5, pp.10-21, 2012. https://doi.org/10.5392/JKCA.2012.12.05.010
  20. H. J. Choi, H. G. Jeon, and C. H. Kim, "Quantitative Anaysis of the Negative Factors on the GPU Performance," Journal of KIISE : Computing Practices and Letters, Vol.18, No.4, pp.282-287, 2012.
  21. H. J. Choi, S. G. Kang, J. M. Kim, and C. H. Kim, "Analysis of the CPU/GPU Temperature and Energy Efficiency depending on Executed Applications," Journal of The Korea Society of Computer and Information, Vol.17, No.5, pp.9-20, 2012. https://doi.org/10.9708/jksci.2012.17.5.009
  22. http://www.amd.com/stream
  23. https://developer.nvidia.com/cg-toolkit
  24. http://msdn2.microsoft.com/en-us/library/bb50 9638.aspx
  25. http://www.opengl.org/registry/doc/GLSLangS pec.Full.1.20.8.pdf
  26. http://www.simplescalar.com
  27. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
  28. http://www.nvidia.com/object/product_quadro_fx_5800_us.html
  29. http://nocs.stanford.edu/booksim.html
  30. http://developer.download.nvidia.com/compute/ cuda/sdk/website/samples.html
  31. http://www.nvidia.com/content/cudazone/
  32. M. J. Flynn, "Very high-speed computing systems," Proceedings of the IEEE, Vol.54, No.12, pp. 1901-1909, 1966. https://doi.org/10.1109/PROC.1966.5273

Cited by

  1. Analysis on the GPU Performance according to Hierarchical Memory Organization vol.14, pp.3, 2014, https://doi.org/10.5392/JKCA.2014.14.03.022