DOI QR코드

DOI QR Code

Analysis on the GPU Performance according to Hierarchical Memory Organization

계층적 메모리 구성에 따른 GPU 성능 분석

  • 최홍준 (전남대학교 전자컴퓨터공학과) ;
  • 김종면 (울산대학교 전기공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학과)
  • Received : 2013.11.14
  • Accepted : 2013.12.26
  • Published : 2014.03.28

Abstract

Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.

병렬 연산에 최적화된 하드웨어를 가진 GPU를 그래픽스 작업 이외에 범용 작업에 활용하고자, 최근에 GPGPU 기술이 큰 관심을 받고 있다. GPU와 같은 대용량 병렬처리 장치에서는 메모리 시스템이 성능에 큰 영향을 미치게 된다. GPU에서는 메모리 시스템의 효율성을 향상시키기 위하여, 메모리 대역폭 사용률을 감소시켜주는 계층적 메모리 구조와 메모리를 요청하는 트랜잭션을 줄여주는 메모리 주소 접합과 메모리 요청 합병 등의 기술들을 사용한다. 본 논문에서는 메모리 시스템 효율성 향상을 위해 활용되는 기법들이 GPU 성능에 미치는 영향을 정량적으로 평가하고 분석하기 위해, 다양한 메모리 구조에 대한 실험을 수행한다. 실험 결과에 따르면, 캐쉬를 사용하지 않는 경우에 비해 8KB, 16KB, 32KB, 64KB의 L1 캐쉬를 추가하면 평균적으로 15.5%, 21.5%, 25.5%, 30.9%의 성능이 각각 향상된다. 하지만, 일부 벤치마크 프로그램에서는 데이터 일관성을 유지하기 위하여 메모리 트랜잭션이 증가함에 따라 오히려 성능이 감소하는 결과를 보이기도 한다. 그리고 메모리 요청에 대한 미스가 많이 발생하는 경우에는 캐쉬 레벨이 증가함에 따라 평균 메모리 접근 지연 시간이 증가하기도 한다.

Keywords

References

  1. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate versus IPC: the end of the road for conventional microarchitectures," In Proceedings of the 27th International Symposium on Computer Architecture, pp.248-259, 2000.
  2. K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor," In Proceedings of 7th Conference on Architectural Support for Programming Languages and Operating Systems, pp.2-11, 1996.
  3. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger, "Dark Silicon and the End of Multicore Scaling," In Proceedings of International Symposium on Computer Architecture, pp.365-376, 2011.
  4. http://www.isuppli.com/
  5. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: stream computing on graphics hardware," In Proceedings of 31th Annual Conference on Computer Graphics, pp.777-786, 2004.
  6. H. J. Choi and C. H. Kim, "Performance Evaluation of the GPU Architecture Executing Parallel Applications," Journal of the Korea Contents Association, Vol.12, No.5. pp.10-21, 2012. https://doi.org/10.5392/JKCA.2012.12.05.010
  7. H. J. Choi and C. H. Kim, "Analysis of Impact of Correlation Between Hardware Configuration and Branch Handling Methods Executing General Purpose Applications," Journal of the Korea Contents Association, Vol.13, No.3. pp.9-21, 2013. https://doi.org/10.5392/JKCA.2013.13.03.009
  8. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware," Euro-graphics 2005, State of the Art Reports, pp.21-51, 2005.
  9. http://www.gpgpu.org
  10. http://www.khronos.org/opencl/
  11. http://www.amd.com/stream
  12. http://developer.nvidia.com/object/cuda_3_1_downloads.html
  13. J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," In Proceedings of 37th International Symposium on Computer Architecture, pp.235-246, 2010.
  14. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," In Proceedings of 40th Microarchitecture, pp.407-420, 2007.
  15. J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch : Enabling Energy Optimizations in GPGPUs," In Proceedings of the 27th International Symposium on Computer Architecture, pp.487-498, 2013.
  16. N. B. Lakshminarayana and H. S. Kim, "Effect of Instruction Fetch and Memory Scheduling on GPU Performance," Workshop on Language, Compiler, and Architecture Support for GPGPU(in conjunction with HPCA/PPoPP 2010), 2010.
  17. http://www.nvidia.com/object/product_quadro_fx_5800_us.html
  18. W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," In Proceedings of the 17th International Symposium on High Performance Computer Architecture, pp.25-36, 2011.
  19. E. Lindholm, J. Nickolls, S.Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE MICRO, Vol.28, No.2, pp.39-55, 2008. https://doi.org/10.1109/MM.2008.31
  20. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of 9th International Symposium on Performance Analysis of Systems and Software, pp.163-174, 2009.
  21. D. C. Burger and T. M. Austin, "The SimpleScalar tool set, version 2.0," Computer Architecture News, Vol.25, No.3, pp.13-25, 1997.
  22. http://nocs.stanford.edu/booksim.html
  23. http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html