DOI QR코드

DOI QR Code

A Study on GPGPU Performance Improvement Technique on GCN Architecture Using OpenCL API

GCN 아키텍쳐 상에서의 OpenCL을 이용한 GPGPU 성능향상 기법 연구

  • Woo, DongHee (Graduate School of Computer Science, Sangmyung University) ;
  • Kim, YoonHo (Department of Computer Science, Sangmyung University)
  • Received : 2017.12.20
  • Accepted : 2018.02.20
  • Published : 2018.02.28

Abstract

The current system upon which a variety of programs are in operation has continuously expanded its domain from conventional single-core and multi-core system to many-core and heterogeneous system. However, existing researches have focused mostly on parallelizing programs based CUDA framework and rarely on AMD based GCN-GPU optimization. In light of the aforementioned problems, our study focuses on the optimization techniques of the GCN architecture in a GPGPU environment and achieves a performance improvement. Specifically, by using performance techniques we propose, we have reduced more then 30% of the computation time of matrix multiplication and convolution algorithm in GPGPU. Also, we increase the kernel throughput by more then 40%.

현재 프로그램이 운용되는 시스템은 기존의 싱글코어 및 멀티코어 환경을 넘어서 매니코어, 부가 프로세스 및 이기종 환경까지 그 영역이 확장되고 있는 중이다. 하지만, 기존 연구의 경우 NVIDIA 벤더에서 나온 아키텍쳐 및 CUDA로의 병렬화가 주로 이루어졌고 AMD에서 나온 범용 GPU 아키텍쳐인 GCN 아키텍쳐에 대한 성능향상에 관한 연구는 제한적으로 이루어졌다. 이런 점을 고려해 본 논문에서는 GCN 아키텍쳐의 GPGPU 환경인 OpenCL 내에서의 성능향상 기법에 대해 연구하고 실질적인 성능향상을 보였다. 구체적으로, 행렬 곱셈과 컨볼루션을 적용한 GPGPU 프로그램을 본 논문에서 제시한 성능향상 기법을 통해 최대 30% 이상의 실행시간을 감소시켰으며, 커널 이용률 또한 40% 이상 높였다.

Keywords

References

  1. AMD OpenCL Programming User Guide.
  2. Aritsugi, M., Fukatsu, H., and Kanamori, Y., “Parallel Image Convolution Processing with Replicas in a Network of Workstations,” Institute of Electronics Information and Communication, Vol. 88, No. 6, pp. 1199-1209, 2005.
  3. Choi, H. J. and Kim, C. H., "Performance Evaluation of the GPU Architecture Executing Parallel Applications," The Korea Contents Society, Vol. 12, No. 5, 10-21, 2012.
  4. Fraire, J. A., Ferreyra, A., and Marques, C., “OpenCL Overview, Implementation, and Performance Comparison,” IEEE, Vol. 11, No. 1, pp. 274-280, 2013.
  5. http://www.amd.com/ko-kr.
  6. http://www.khronos.org/opencl/.
  7. Huang, D., Wen, M., Xun, C., Chen, D., Cai, X., Qiao, Y., Wu, N., and Zhang, C., "Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Muiti-Core/Many-Core CPUs," Lecture Notes in Computer Science, No. 8632, pp. 210-221, 2014.
  8. Jung, H. I., Park, I. S., and Ahn, H. C., “Identifying the Key Success Factors of Massively Multiplayer Online Role Playing Game Design using Artificial Neural Networks,” The Journal of Society for e-Business Studies, Vol. 17, No. 1, pp. 23-38, 2012. https://doi.org/10.7838/JSEBS.2012.17.1.023
  9. Lee, D., Dinov, I., Dong, B., Gutman, B., Yanovsky, I., and Toga, A. W., “CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms,” Computer Methods and Programs in Biomedicine, Vol. 106, No. 3, pp. 175-187, 2012. https://doi.org/10.1016/j.cmpb.2010.10.013
  10. Lee, S. G., “Enhancing Performance of Embedded System using FPGA Processor,” Namseoul University Press, Vol. 7, No. 1, pp. 56-67, 2010.
  11. Lee, Y. H. and Kim, Y. J., “Parallel Intersection Detection Algorithm using CUDA,” HCI, Vol. 2008, No. 2, pp. 451-455, 2008.
  12. Moon, H. J., Jeon, J. N., and Kim, S., “A Performance Analysis for Benchmarks on Heterogeneous Environment,” KISS, Vol. 23, No. 2B, pp. 1635-1638, 1996.
  13. Oyarzun, G., Borrell, R., Gorobets, A., and Oliva, A., "MPI-CUDA sparse matrixvector multiplication for the conjugate gradient method with an approximate inverse preconditioner," Computers & Fluids, Vol. 92, pp. 244-252, 2014. https://doi.org/10.1016/j.compfluid.2013.10.035
  14. Venetillo, J. S. and Celes, W., "GPU-based particle simulation with inter-collisions," The Visual Computer, Vol. 23, No. 9-11, pp. 851-860, 2007 https://doi.org/10.1007/s00371-007-0151-6