References
- S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach, "Accelerating compute-intensive applications with GPUs and FPGAs," in Proceedings of Symposium on Application Specific Processors (SASP 2008), Anaheim, CA, 2008, pp. 101-107.
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899, 2008. https://doi.org/10.1109/JPROC.2008.917757
- J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, "Real-time human pose recognition in parts from single depth images," Communications of the ACM, vol. 56, no. 1, pp. 116-124, 2013. https://doi.org/10.1145/2398356.2398381
- NVIDIA Tegra mobile processors, http://www.nvidia.com/object/tegra.html.
- NVIDIA DRIVE PX2, http://www.nvidia.com/object/drivepx.html.
- NVIDIA CUDA Toolkit Documentation v7.0, https://developer.nvidia.com/cuda-toolkit.
- J. E. Stone, D. Gohara, and G. Shi, "OpenCL: a parallel programming standard for heterogeneous computing systems," Computing in Science & Engineering, vol. 12, no. 3, pp. 66-73, 2010.
- M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm, "Cache behavior prediction by abstract interpretation," Static Analysis, Lecture Notes in Computer Science vol. 1145, Heidelberg: Springer, 1996, pp. 52-66.
- NVIDIA CUDA Parallel Thread Execution ISA version 4.2, http://www.nvidia.com.
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2009), Boston, MA, 2009, pp. 163-174.
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: a benchmark suite for heterogeneous computing," in Proceedings of IEEE International Symposium on Workload Characterization (IISWC 2009), Austin, TX, 2009, pp. 44-54.
- D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta, "Real-time parallel hashing on the GPU," ACM Transactions on Graphics (TOG), vol. 28, no. 5, article no. 154, 2009.
- U. Verner, A. Schuster, and M. Silberstein, "Processing data streams with hard real-time constraints on heterogeneous systems," in Proceedings of the International Conference on Supercomputing, Tucson, AZ, 2011, pp. 120-129.
- B. Andersson, G. Raravi, and K. Bletsas, "Assigning realtime tasks on heterogeneous multiprocessors with two unrelated types of processors," in Proceedings of 2010 IEEE 31st Real-Time Systems Symposium (RTSS), San Diego, CA, 2010.
- G. A. Elliott and J. H. Anderson, "Globally scheduled realtime multiprocessor systems with GPUs," Real-Time Systems, vol. 48, no. 1, pp. 34-74, 2012. https://doi.org/10.1007/s11241-011-9140-y
- G. Elliott, B. Ward, and J. Anderson, "Gpusync: architectureaware management of GPUs for predictable multi-GPU realtime systems," in Proceedings of 34th IEEE RTSS, Vancouver, Canada, 2013, pp. 33-44.
- X. Vera, B. Lisper, and J. Xue, "Data cache locking for higher program predictability," ACM SIGMETRICS Performance Evaluation Review, vol. 31, no. 1, pp. 272-282, 2003.
- V. Suhendra and T. Mitra, "Exploring locking & partitioning for predictable shared caches on multi-cores," in Proceedings of the 45th annual Design Automation Conference, Anaheim, CA, 2008, pp. 300-303.
- H. Ding, Y. Liang, and T. Mitra, "WCET-centric partial instruction cache locking," in Proceedings of 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC), San Francisco, CA, 2012, pp. 412-420.
- R. Banakar, S. Steinke, B. S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad memory: design alternative for cache on-chip memory in embedded systems," in Proceedings of the 10th International Symposium on Hardware/Software Codesign, Estes Park, CO, 2002, pp. 73-78.
- M. Schoeberl, "A time predictable instruction cache for a Java processor," in On the Move to Meaningful Internet Systems: OTM 2004 Workshop. Heidelberg: Springer, 2004, pp. 371-382.
- D. Hardy and I. Puaut, "WCET analysis of multi-level noninclusive set-associative instruction caches," in Proceedings of Real-Time Systems Symposium (RTSS), Barcelona, Spain, 2008, pp. 456-466.
- Y. Yan and W. Zhang, "WCET analysis for multi-core processors with shared L2 instruction caches," in Proceedings of Real-Time and Embedded Technology and Applications Symposium (RTAS'08), St. Louis, MO, 2008, pp. 80-89.
- Y. Li, V. Suhendra, Y. Liang, T. Mitra, and A. Roychoudhury, "Timing analysis of concurrent programs running on shared cache multi-cores," in Proceedings of Real-Time Systems Symposium (RTSS), Washington, DC, 2009, pp. 57-67.
- B. K. Huynh, L. Ju, and A. Roychoudhury, "Scope-aware data cache analysis for WCET estimation," in Proceedings of 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Chicago, IL, 2011, pp. 203-212.
- X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, 2013, pp. 516-523.
- W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: memory request prioritization for massively parallel processors," in Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, 2014, pp. 272-283.
- A. Betts and A. Donaldson, "Estimating the WCET of GPUaccelerated applications using hybrid analysis," in Proceedings of 2013 25th Euromicro Conference on Real-Time Systems (ECRTS), Paris, France, 2013, pp. 193-202.
- K. Berezovskyi, L. Santinelli, K. Bletsas, and E. Tovar, "WCET measurement-based and extreme value theory characterisation of CUDA kernels," in Proceedings of the 22nd International Conference on Real-Time Networks and Systems, Versaille, France, 2014, p. 279.