DOI QR코드

DOI QR Code

PartitionTuner: An operator scheduler for deep-learning compilers supporting multiple heterogeneous processing units

  • Misun Yu (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Yongin Kwon (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Jemin Lee (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Jeman Park (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Junmo Park (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Taeho Kim (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)
  • 투고 : 2021.11.22
  • 심사 : 2022.10.25
  • 발행 : 2023.04.20

초록

Recently, embedded systems, such as mobile platforms, have multiple processing units that can operate in parallel, such as centralized processing units (CPUs) and neural processing units (NPUs). We can use deep-learning compilers to generate machine code optimized for these embedded systems from a deep neural network (DNN). However, the deep-learning compilers proposed so far generate codes that sequentially execute DNN operators on a single processing unit or parallel codes for graphic processing units (GPUs). In this study, we propose PartitionTuner, an operator scheduler for deep-learning compilers that supports multiple heterogeneous PUs including CPUs and NPUs. PartitionTuner can generate an operator-scheduling plan that uses all available PUs simultaneously to minimize overall DNN inference time. Operator scheduling is based on the analysis of DNN architecture and the performance profiles of individual and group operators measured on heterogeneous processing units. By the experiments for seven DNNs, PartitionTuner generates scheduling plans that perform 5.03% better than a static type-based operator-scheduling technique for SqueezeNet. In addition, PartitionTuner outperforms recent profiling-based operator-scheduling techniques for ResNet50, ResNet18, and SqueezeNet by 7.18%, 5.36%, and 2.73%, respectively.

키워드

과제정보

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00769: Neuromorphic Computing Software Platform for Artificial Intelligence Systems and No.2022-0-00454: Technology development of smart edge device SW development platform).

참고문헌

  1. HISILICON, Kirin, 2022. https://www.hisilicon.com/en/products/Kirin
  2. NVIDIA, Jetson, 2022. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/
  3. Samsung, Exynos, 2022. https://semiconductor.samsung.com/processor/mobile-processor/
  4. T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, and C. Guestrin, TVM: An automated end-to-end optimizing compiler for deep learning, (13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, CA, USA), 2018, pp. 578-594.
  5. S. Cyphers, A. K. Bansal, A. Bhiwandiwalla, J. Bobba, M. Brookhart, A. Chakraborty, W. Constable, C. Convey, L. Cook, O. Kanawi, R. Kimball et al., Intel nGraph: An intermediate representation, compiler, and executor for deep learning, arXive preprint, 2018. https://doi.org/10.48550/arXiv.1801.08058
  6. C. Leary and T. Wang, XLA: TensorFlow, compiled, 2017. TensorFlow Dev Summit.
  7. W.-F. Lin, D.-Y. Tsai, L. Tang, C.-T. Hsieh, C.-Y. Chou, P.-H. Chang, and L. Hsu, ONNC: A compilation framework connecting ONNX to proprietary deep learning accelerators, (IEEE International Conference on Artificial Intelligence Circuits and Systems, Hsinchu, Taiwan), 2019, pp. 214-218.
  8. N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov, N. Gibson, J. Hegeman, M. Lele, R. Levenstein, and J. Montgomery, Glow: Graph lowering compiler techniques for neural networks, arXive preprint, 2018. https://doi.org/10.48550/arXiv.1805.00907
  9. M. Zhang, Z. Hu, and M. Li, DUET: A compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture, (IEEE International Parallel and Distributed Processing Symposium, IEEE Portland, OR, 2021, pp. 151-161.
  10. ETRI, NEST-C, 2021. https://github.com/etri/nest-compiler
  11. Y. Ding, L. Zhu, Z. Jia, G. Pekhimenko, and S. Han, IOS: Interoperator scheduler for CNN acceleration, Proc. Machine Learn. Syst. 3 (2021), 167-180.
  12. L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L. Zhang, and L. Zhou, RAMMER: Enabling holistic deep learning compiler optimizations with rTasks, (14th USENIX Symposium on Operating Systems Design and Implementation), 2020, pp. 881-897.
  13. T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, VTA: an open hardware-software stack for deep learning, arXive preprint, 2018. arXiv preprint arXiv: 1807.04188. https://doi.org/10.48550/arXiv.1807.04188
  14. ONNX, ONNX operators, 2022. https://github.com/onnx/onnx/blob/main/docs/Operators.md
  15. Y. Xing, S. Liang, L. Sui, X. Jia, J. Qiu, X. Liu, Y. Wang, Y. Shan, and Y. Wang, DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators, IEEE Trans. Comput.-Aided Design Integrated Circ. Syst. 39 (2020), no. 10, 2668-2681. https://doi.org/10.1109/TCAD.2019.2930577
  16. J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, ImageNet: A large-scale hierarchical image database, (IEEE Conference on Computer Vision and Pattern Recognition IEEE, Miami, FL, USA), 2009, pp. 248-255.
  17. M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, European Conference on Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, (eds.), Springer, Cham, 2014, pp. 818-833.
  18. A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM. 60 (2017), no. 6, 84-90. https://doi.org/10.1145/3065386
  19. C. Szegedy, W. Liu, Y. Jia, et al., Going deeper with convolutions, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA), 2015, pp. 1-9.
  20. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA), 2016, pp. 770-778.
  21. S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017, pp. 1492-1500.
  22. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXive preprint, 2016. https://doi.org/10.48550/arXiv.1602.07360
  23. ONNX, Onnx model zoo, 2022. https://github.com/onnx/models
  24. N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, aeXive preprint, 2018. https://doi.org/10.48550/arXiv.1802.04730
  25. N. P. Jouppi, C. Young, N. Patil, et al., In-datacenter performance analysis of a tensor processing unit, (Proceedings of the 44th Annual International Symposium on Computer Architecture, Association for Computing Machinery, Toronto, Canada), 2017, pp. 1-12.
  26. Z. Chen, C. H. Yu, T. Morris, J. Tuyls, Y. H. Lai, J. Roesch, E. Delaye, V. Sharma, and Y. Wang, Bring your own codegen to deep learning compiler, arXive preprint, 2021. https://doi.org/10.48550/arXiv.2105.03215