참고문헌
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and Wen-Mei W. Hwu, "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp.73-82, 2008.
- NVIDIA. "CUDA C Programming Guide," 2012.
- M. Garland et al., "Parallel Computing Experiences with CUDA," MICRO, IEEE, Vol.28, No.4, 2008.
- A. Munshi, "The OpenCL Specification," Version 1.2, Khronos OpenCL Working Group, 2011.
- O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs," in CSE Penn State Tech Report, TR-CES- 2212-006, 2012.
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, pp.308-317, 2011.
- A. Bakhola, G. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU simulator," in Proceedings of the 2009 International Symposium on Analysis of Systems and Software (ISPASS-2009), pp. 163-174, Apr. 2009.
- M. Lee et.al, "Improving GPGPU Resource Utilization through Alternative Thread Block Scheduling," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.260-271, 2014.
- V. V. P. Harish and P. J. Narayanan, "Large graph algorithms for massively multithreaded architectures," in Technical report, IIIT, 2009
- J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing," in Proceedings of the IEEE, Vol.96, No.5, pp.879-899.
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, "Optimization Principles and Aapplication Performance Evaluation of a Multithreaded GPU Using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, pp.73-82, 2008.
- V. Volkov and J. W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," in Proceedings of the ACM/IEEE Conference on Supercomputing, pp.1-11, 2008.
- M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors," in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA), pp.235-246, 2011.
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache- Conscious Wavefront Scheduling," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.72-83, 2013.
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proceedings of the 40th Annual IEEE/ ACM International Symposium on Microarchitecture, IEEE Computer Society, pp.407-420, 2007.
- J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-Thread Aware Prefetching Mechanisms for GPGPU Applications," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE Computer Society, pp.213-224, 2010.
- A. Jog et al., "Orchestrated Scheduling and Prefetching for GPGPUs," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pp.332-343, Tel-Aviv, Israel, 2013.
- Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," Computer Architecture Letters, Vol.8, No.1, pp.25-28, 2009. https://doi.org/10.1109/L-CA.2009.4
- H.-Y. Cheng, C.-H. Lin, J. Li, and C.-L. Yang, "Memory Latency Reduction via Thread Throttling," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.53-64, 2010.
- K. M. Abdalla et al., "Scheduling and Execution of Compute Tasks," US Patent US20130185725, 2013.
- J. D. Owens et al., "A Survey of Genera-Purpose Computation on Graphics Hardware," in Eurographics 2005, State of the Art Reports, pp.21-51, Aug., 2005.
- W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.356-367, 2011.
- K. Krewell, "AMD's Fusion Finally Arrives," Microprocessor Report, 2011.
- K. Krewell, "NVIDIA Lowers the Heat on Kepler," Microprocessor Report, 2012.
- NVIDIA, Whitepaper: NVIDIA's Next Generation CUDA Compute and Graphics Architecture: Fermi.
- NVIDIA, "NVIDA Tegra Multiprocessor Architecture," Feb. 2010.
- J. Chen et al., "Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency," in Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, pp.441-451, 2013.
- D. Kirk, "NVIDIA CUDA Software and GPU Parallel Computing Architecture," in ISMM, pp.103-104, 2007.
- NVIDA, CUDA SDK [Internet], http://developer.nvidia.com/gpu-computing-sdk.
- A. Jog et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.395-406, 2013.
- S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pp.515-527, 2015.
- W. Jia, K. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.272- 283, 2014.
- X. Xie et al., "Coordinated Static and Dynamic Cache Bypassing for GPUs," in Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp.76-88, 2015.