DOI QR코드

DOI QR Code

2차원 구조 대비 3차원 구조 GPU의 메모리 접근 효율성 분석

Memory Delay Comparison between 2D GPU and 3D GPU

  • 전형규 (전남대학교 전자컴퓨터공학부) ;
  • 안진우 (전남대학교 전자컴퓨터공학부) ;
  • 김종면 (울산대학교 전기공학부 컴퓨터정보통신공학과) ;
  • 김철홍 (전남대학교 전자컴퓨터공학부)
  • Jeon, Hyung-Gyu (School of Electronics and Computer Engineering, Chonnam National University) ;
  • Ahn, Jin-Woo (School of Electronics and Computer Engineering, Chonnam National University) ;
  • Kim, Jong-Myon (School of Computer Engineering and Information Technology, University of Ulsan) ;
  • Kim, Cheol-Hong (School of Electronics and Computer Engineering, Chonnam National University)
  • 투고 : 2011.08.25
  • 심사 : 2012.04.20
  • 발행 : 2012.07.31

초록

최근 반도체 공정 기술이 발달함에 따라 단일 프로세서에 적재되는 코어의 수가 크게 증가하였고, 이는 프로세서의 성능을 급격하게 향상시키는 계기가 되고 있다. 특히, 많은 수의 코어들로 구성된 GPU(Graphics Processing Unit)는 대규모 병렬성을 활용하여 연산처리 성능을 크게 향상시키고 있다. 하지만, 주 메모리 접근 지연시간이 GPU의 성능 향상을 제약하는 심각한 요인 중 하나로 제기되는 상황이다. 본 논문에서는 3차원 구조를 통한 GPU의 메모리 접근 효율성 향상에 대한 정량적 분석과 3차원 구조 적용 시 발생 가능한 문제점에 대하여 살펴보고자 한다. 일반적으로 메모리 명령어 비율은 평균적으로 전체 명령어의 30%를 차지하고, 메모리 명령어 중에서 주 메모리 접근과 관련된 글로벌/로컬 메모리 명령어가 차지하는 비율 또한 평균 60%이므로 주 메모리로의 접근 지연시간을 크게 감소시키는 3차원 구조를 적용한다면 GPU의 성능 또한 크게 향상시킬 수 있을 것으로 예상된다. 그러나 본 논문에서 수행한 실험 결과에 따르면 메모리 병목현상으로 인해 3차원 구조 GPU의 성능이 2차원 구조 GPU에 비해 크게 향상되지는 않음을 확인할 수 있다. 분석 결과에 의하면, 3차원 구조 GPU는 2차원 구조 GPU와 비교하여 메모리 병목현상으로 인한 성능 지연이 최대 245%까지 증가하기 때문이다. 본 논문에서는 3차원 구조 GPU를 대상으로 메모리 접근의 효율성과 문제점을 함께 분석함으로써, 3차원 GPU에 적합한 메모리 구조를 설계하기 위한 가이드라인을 제시하고자 한다.

As process technology scales down, the number of cores integrated into a processor increases dramatically, leading to significant performance improvement. Especially, the GPU(Graphics Processing Unit) containing many cores can provide high computational performance by maximizing the parallelism. In the GPU architecture, the access latency to the main memory becomes one of the major reasons restricting the performance improvement. In this work, we analyze the performance improvement of the 3D GPU architecture compared to the 2D GPU architecture quantitatively and investigate the potential problems of the 3D GPU architecture. In general, memory instructions account for 30% of total instructions, and global/local memory instructions constitutes 60% of total memory instructions. Therefore, the performance of the 3D GPU is expected to be improved significantly compared to the 2D GPU by reducing the delay of memory instructions. However, according to our experimental results, the 3D architecture improves the GPU performance only by 2% compared to the 2D architecture due to the memory bottleneck, since the performance reduction due to memory bottleneck in the 3D GPU architecture increases by 245% compared to the 2D architecture. This paper provides the guideline for suitable memory design by analyzing the efficiency of the memory architecture in 3D GPU architecture.

키워드

참고문헌

  1. Cena. G, Cereia. M, Scanzio. S, Valenzano. A, Zunino. C, "A high-performance CUDA-based computing platform for industrial control systems," In Proceedings of the Industrial Electronics 2011 IEEE International Symposium, pp.1169-1174, Gdansk, Poland, Jun. 2011.
  2. Maruyama N, Nukada A, Matsuoka S, "A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs," In Proceedings of 24th IEEE International Symposium on Parallel & Distributed Processing, pp.1-12, Atlanta, USA, Apr. 2010.
  3. Jishen Zhao, Xiangyu Dong, Yuan Xie, "An Energy-Efficient 3D CMP Design with Fine-Grained voltage Scaling," In Proceedings of Design, Automation & Test in Europe Conference & Exhibition, pp.1-4, Grenoble, France, Mar. 2011.
  4. D. H. Kim, K. Athikulwongse, S. K. Lim, "A Study of Through-Silicon-Via Impact on the 3D Stacked IC Layout," In Proceedings of the 2009 International Conference on Computer-Aided Design, pp.674-680, California, USA, Nov. 2009.
  5. Joyner J. W, Zarkesh Ha P, Meindl J. D, "A Stochastic Global Net-length Distribution for a Three-Dimensional System on Chip (3D-SoC)," In Proceedings of the 14th IEEE International ASIC/SOC Conference, pp.147-151, Arlington, USA, Sep. 2001.
  6. K. Puttaswamy, and G. H. Loh, "Thermal Analysis of a 3D Die Stacked High Performance Microprocessor," In Proceedings of ACM GreatLakes Symposium on VLSI, pp.19-24, Philadelphia, USA, May. 2006.
  7. J. Kim, C. Nicopoulos, D. Park, R. Das, Y. Xie, V. Narayanan, M. Yousif, and C. Das, "A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3D Architectures," In Proceedings of the International Symposium on Computer Architecture, pp.138-149, San Diego, USA, Jun. 2007.
  8. F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir, "Design and Management of 3D Chip Multiprocessors Using Network-in-Memory," In Proceedings of the International Symposium on Computer Architecture, pp.130-141, Boston, USA, May. 2006.
  9. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips, "GPU computing," In Proceedings of IEEE, Vol. 96, no. 5, pp.879-899, California, USA, May. 2008. https://doi.org/10.1109/JPROC.2008.917757
  10. M. R. Thistle, B. J. Smith, "A processor architecture for Horizon," In Proceedings of SuperComputing, Vol. 1, Florida, USA, Nov. 1988.
  11. Jiayan Meng, David Tarjan, Kevin Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," In Proceedings of the 37th annual international symposium on Computer architecture, pp.235-246, Saint-Malo, France, Jun. 2010.
  12. Jaekyu Lee, Lakshminarayana N. B, Hyesoon Kim, Vuduc R, "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications," In Proceedings of 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp.213-224, Georgia, USA, Dec. 2010.
  13. Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, Xipeng Shen, "On-the-Fly Elimination of Dynamic Irregularities for GPU Computing," In Proceedings of the 16th International Conference on Architectural support for programming languages and operating systems, pp.369-380, California, USA, Mar. 2011.
  14. D. Burger, T. M. Austin, S. Bennett, "Evaluating future microprocessors: the SimpleScalar tool set," Technical Report TR-1308, University of Wisconsin-Madison Computer Sciences Department, Jul. 1997.
  15. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, pp.163-174, Miami, USA, Apr. 2009.
  16. Chang D. W, Jenkins C. D, Garcia P. C, Gilani S. Z, Aguilera P, Nagarajan A, Anderson M. J, Kenny M. A, Bauer S. M, Schulte M. J, Compton K, "ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing," In Proceedings of International Conference on Field Programmable Logic and Applications, pp.408-413, Milano, Italy, Sep. 2010.
  17. Goswami N. Shankar R. Joshi M. Tao Li, "Exploring GPGPU Workloads: Characterization Methodology, Analysis and Microarchitecture Evaluation Implications," In Proceedings of IEEE International Symposium on Workload Characterization, pp.1-10, Georgia, USA, Dec. 2010.
  18. Bakhoda A, Kim J, Aamodt T. M, "Throughput-Effective On-Chip Networks for Manycore Accelerators," In Proceedings of the 43th Annual IEEE/ACM International Symposium on Microarchitecture, pp.421-432, Georgia, USA, Dec. 2010.
  19. Samsung 512Mbit GDDR3 SDRAM, http://www.samsung.com/global/system/business/semicond uctor/product/2008/5/22/841580ds_k4j52324qh_rev10.pdf.
  20. Booksim interconnection network simulator, http://nocs.stanford.edu/booksim.html.

피인용 문헌

  1. GPU를 이용한 기타의 음 합성을 위한 효과적인 병렬 구현 vol.18, pp.8, 2012, https://doi.org/10.9708/jksci.2013.18.8.001