GPGPU 자원 활용 개선을 위한 블록 지연시간 기반 워프 스케줄링 기법

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization

  • ;
  • 최용 (전남대학교 전자컴퓨터공학부) ;
  • 김종면 (울산대학교 전기공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학부)
  • 투고 : 2016.11.01
  • 심사 : 2017.01.24
  • 발행 : 2017.05.31


멀티스레딩 기법이 적용된 GPGPU는 내부 병렬 자원들을 기반으로 데이터를 고속으로 처리하고 메모리 접근시간을 감소시킬 수 있다. CUDA, OpenCL 등과 같은 프로그래밍 모델을 활용하면 스레드 레벨 처리를 통해 응용프로그램의 고속 병렬 수행이 가능하다. 하지만, GPGPU는 범용 목적의 응용프로그램을 수행함에 있어 내부 하드웨어 자원들을 효과적으로 사용하지 못한다는 단점을 보이고 있다. 이는 GPGPU에서 사용하는 기존의 워프/스레드 블록 스케줄러가 메모리 접근시간이 긴 명령어를 처리하는데 있어서 비효율적이기 때문이다. 이와 같은 문제점을 해결하기 위해 본 논문에서는 GPGPU 자원 활용률을 개선하기 위한 새로운 워프 스케줄링 기법을 제안하고자 한다. 제안하는 워프 스케줄링 기법은 스레드 블록의 워프들 중 긴 메모리 접근시간을 가진 워프와 짧은 메모리 접근시간을 가진 워프들을 구분한 후, 긴 메모리 접근시간을 가진 워프를 우선 할당하고, 짧은 메모리 접근시간을 가진 워프를 나중에 할당하여 처리한다. 또한, 메모리와 내부 연결망에서 높은 경합이 발생했을 때 동적으로 스트리밍 멀티프로세서의 수를 감소시켜 워프 스케줄러를 효과적으로 사용할 수 있는 기법도 제안한다. 실험결과에 따르면, 15개의 스트리밍 멀티프로세서를 가진 GPGPU 플랫폼에서 제안된 워프 스케줄링 기법은 기존의 라운드로빈 워프 스케줄링 기법과 비교하여 평균 7.5%의 성능(IPC)이 향상됨을 확인할 수 있다. 또한, 제안된 두 개의 기법을 동시에 적용하였을 경우에는 평균 8.9%의 성능(IPC) 향상을 보인다.

General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.



  1. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and Wen-Mei W. Hwu, "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp.73-82, 2008.
  2. NVIDIA. "CUDA C Programming Guide," 2012.
  3. M. Garland et al., "Parallel Computing Experiences with CUDA," MICRO, IEEE, Vol.28, No.4, 2008.
  4. A. Munshi, "The OpenCL Specification," Version 1.2, Khronos OpenCL Working Group, 2011.
  5. O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs," in CSE Penn State Tech Report, TR-CES- 2212-006, 2012.
  6. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, pp.308-317, 2011.
  7. A. Bakhola, G. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU simulator," in Proceedings of the 2009 International Symposium on Analysis of Systems and Software (ISPASS-2009), pp. 163-174, Apr. 2009.
  8. M. Lee, "Improving GPGPU Resource Utilization through Alternative Thread Block Scheduling," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.260-271, 2014.
  9. V. V. P. Harish and P. J. Narayanan, "Large graph algorithms for massively multithreaded architectures," in Technical report, IIIT, 2009
  10. J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing," in Proceedings of the IEEE, Vol.96, No.5, pp.879-899.
  11. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, "Optimization Principles and Aapplication Performance Evaluation of a Multithreaded GPU Using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp.73-82, 2008.
  12. V. Volkov and J. W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," in Proceedings of the ACM/IEEE Conference on Supercomputing, pp.1-11, 2008.
  13. M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), pp.235-246, 2011.
  14. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache- Conscious Wavefront Scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.72-83, 2013.
  15. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proceedings of the 40th Annual IEEE/ ACM International Symposium on Microarchitecture, IEEE Computer Society, pp.407-420, 2007.
  16. J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE Computer Society, pp.213-224, 2010.
  17. A. Jog et al., "Orchestrated Scheduling and Prefetching for GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp.332-343, Tel-Aviv, Israel, 2013.
  18. Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," Computer Architecture Letters, Vol.8, No.1, pp.25-28, 2009.
  19. H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang, "Memory Latency Reduction via Thread Throttling," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.53-64, 2010.
  20. K. M. Abdalla et al., "Scheduling and Execution of Compute Tasks," US Patent US20130185725, 2013.
  21. J. D. Owens et al., "A Survey of Genera-Purpose Computation on Graphics Hardware," in Eurographics 2005, State of the Art Reports, pp.21-51, Aug., 2005.
  22. W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.356-367, 2011.
  23. K. Krewell, "AMD's Fusion Finally Arrives," Microprocessor Report, 2011.
  24. K. Krewell, "NVIDIA Lowers the Heat on Kepler," Microprocessor Report, 2012.
  25. NVIDIA, Whitepaper: NVIDIA's Next Generation CUDA Compute and Graphics Architecture: Fermi.
  26. NVIDIA, "NVIDA Tegra Multiprocessor Architecture," Feb. 2010.
  27. J. Chen et al., "Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency," in Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, pp.441-451, 2013.
  28. D. Kirk, "NVIDIA CUDA Software and GPU Parallel Computing Architecture," in ISMM, pp.103-104, 2007.
  29. NVIDA, CUDA SDK [Internet],
  30. A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.395-406, 2013.
  31. S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp.515-527, 2015.
  32. W. Jia, K. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.272- 283, 2014.
  33. X. Xie et al., "Coordinated Static and Dynamic Cache Bypassing for GPUs," in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.76-88, 2015.