Algorithmic GPGPU Memory Optimization

Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki;

doi:10.5573/JSTS.2014.14.4.391

JSTS:Journal of Semiconductor Technology and Science

Volume 14 Issue 4
/
Pages.391-406
/
2014
/
1598-1657(pISSN)
/
2233-4866(eISSN)

The Institute of Electronics and Information Engineers (대한전자공학회)

DOI QR Code

Algorithmic GPGPU Memory Optimization

Jang, Byunghyun (Department of Computer and Information Science, The University of Mississippi, University) ;
Choi, Minsu (Department Electrical & Computer Engineering, Missouri University of Science & Technology) ;
Kim, Kyung Ki (Department Electrical & Computer Engineering, Daegu University)

Received : 2014.02.01
Accepted : 2014.04.29
Published : 2014.08.30

https://doi.org/10.5573/JSTS.2014.14.4.391 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.

Keywords

References

J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU Computing," in Proceedings of the IEEE, vol. 96, 2008, pp. 879-899. https://doi.org/10.1109/JPROC.2008.917757
J. Vetter, "Toward exascale computational science with heterogeneous processing," in GPGPU '10: Proceedings of the 3rd Workshop on General- Purpose Computation on Graphics Processing Units. New York, NY, USA: ACM, 2010, pp. 1-1.
B. Jang, D. Schaa, P. Mistry, and D. Kaeli, "Exploiting memory access patterns to improve memory performance in data parallel architectures," IEEE Transactions on Parallel and Distributed Systems, 2010.
M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens, "Efficient computation of sumproducts on GPUs through softwaremanaged cache," in ICS '08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 309-318.
K. Fatahalian, J. Sugerman, and P. Hanrahan, "Understanding the efficiency of GPU algorithms for matrix-matrix multiplication," in HWWS '04: Proceedings of the ACM SIGGRAPH/ EUROGRAPHICS conference on Graphics hardware. New York, NY, USA: ACM, 2004, pp. 133-137.
C. Jiang and M. Snir, "Automatic tuning matrix multiplication performance on graphics hardware," in Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on, Sept. 2005, pp. 185-194.
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories," in PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. New York, NY, USA: ACM, 2008, pp. 1-10.
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, et. al., "A compiler framework for optimization of affine loop nests for GPGPUs," in ICS '08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 225-234.
NVIDIA, "NVIDIA CUDA C Programming Guide 4.2." [Online]. Available: {http://www.nvidia.com/cuda/}
Khronos Group, "OpenCL 1.0 Specification," Dec. 2008. [Online]. Available: {http://www.khronos.org/opencl/}
NVIDIA, "OpenCL Programming Guide for the CUDA Architecture," May 2010. [Online]. Available:{http://developer.nvidia.com/object/cuda_3_1_ downloads .html}
AMD, "OpenCL Programming Guide," Jun 2013. [Online]. Available: {http://developer.amd.com/}
NVIDIA, "GPU Computing SDK Code Samples 4.2." [Online]. Available: {www.nvidia.com/object/cuda develop.html}
S. Ghosh, M. Martonosi, and S. Malik, "Cache miss equations: an analytical representation of cache misses," in ICS '97: Proceedings of the 11th international conference on Supercomputing. New York, NY, USA: ACM, 1997, pp. 317-324.
M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in PLDI '91: Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation. New York, NY, USA: ACM, 1991, pp. 30-44.
M. E. Wolf, M. S. Lam, "A loop transformation theory and an algorithm to maximize parallelism," IEEE Trans. Parallel Distrib. Syst., vol. 2, no. 4, pp. 452-471, 1991. https://doi.org/10.1109/71.97902
AMD, "Stream Computing Forum." [Online]. Available: {http: //forums.amd.com/devforum/}
NVIDIA, "CUDA Forum." [Online]. Available: {http://forums.nvidia.com/}
B. Jang, D. Kaeli, S. Do, and H. Pien, "Multi GPU implementation of iterative tomographic reconstruction algorithms," in ISBI'09: Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging. Piscataway, NJ, USA: IEEE Press, 2009, pp. 185-188.
M. Christiaens, B. De Sutter, K. De Bosschere, J. Van Campenhout, and I. Lemahieu, "A fast, cacheaware algorithm for the calculation of radiological paths exploiting subword parallelism," Journal of Systems Architecture, vol. 45, no. 10, pp. 781-790, 4 1999. https://doi.org/10.1016/S1383-7621(98)00038-1
H. Bay, T. Tuytelaars, and L. Van Gool, "SURF: Speeded Up Robust Features," in Computer Vision ECCV 2006, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2006, vol. 3951, pp. 404-417.

JSTS:Journal of Semiconductor Technology and Science

Algorithmic GPGPU Memory Optimization

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)