PartitionTuner: An operator scheduler for deep-learning compilers supporting multiple heterogeneous processing units

Misun Yu;Yongin Kwon;Jemin Lee;Jeman Park;Junmo Park;Taeho Kim;

doi:10.4218/etrij.2021-0446

ETRI Journal

Volume 45 Issue 2
/
Pages.318-328
/
2023
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

PartitionTuner: An operator scheduler for deep-learning compilers supporting multiple heterogeneous processing units

Misun Yu (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Yongin Kwon (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Jemin Lee (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Jeman Park (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Junmo Park (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute) ;
Taeho Kim (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute)

Received : 2021.11.22
Accepted : 2022.10.25
Published : 2023.04.20

https://doi.org/10.4218/etrij.2021-0446 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, embedded systems, such as mobile platforms, have multiple processing units that can operate in parallel, such as centralized processing units (CPUs) and neural processing units (NPUs). We can use deep-learning compilers to generate machine code optimized for these embedded systems from a deep neural network (DNN). However, the deep-learning compilers proposed so far generate codes that sequentially execute DNN operators on a single processing unit or parallel codes for graphic processing units (GPUs). In this study, we propose PartitionTuner, an operator scheduler for deep-learning compilers that supports multiple heterogeneous PUs including CPUs and NPUs. PartitionTuner can generate an operator-scheduling plan that uses all available PUs simultaneously to minimize overall DNN inference time. Operator scheduling is based on the analysis of DNN architecture and the performance profiles of individual and group operators measured on heterogeneous processing units. By the experiments for seven DNNs, PartitionTuner generates scheduling plans that perform 5.03% better than a static type-based operator-scheduling technique for SqueezeNet. In addition, PartitionTuner outperforms recent profiling-based operator-scheduling techniques for ResNet50, ResNet18, and SqueezeNet by 7.18%, 5.36%, and 2.73%, respectively.

Keywords

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00769: Neuromorphic Computing Software Platform for Artificial Intelligence Systems and No.2022-0-00454: Technology development of smart edge device SW development platform).

References

HISILICON, Kirin, 2022. https://www.hisilicon.com/en/products/Kirin
NVIDIA, Jetson, 2022. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/
Samsung, Exynos, 2022. https://semiconductor.samsung.com/processor/mobile-processor/
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, and C. Guestrin, TVM: An automated end-to-end optimizing compiler for deep learning, (13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, CA, USA), 2018, pp. 578-594.
S. Cyphers, A. K. Bansal, A. Bhiwandiwalla, J. Bobba, M. Brookhart, A. Chakraborty, W. Constable, C. Convey, L. Cook, O. Kanawi, R. Kimball et al., Intel nGraph: An intermediate representation, compiler, and executor for deep learning, arXive preprint, 2018. https://doi.org/10.48550/arXiv.1801.08058
C. Leary and T. Wang, XLA: TensorFlow, compiled, 2017. TensorFlow Dev Summit.
W.-F. Lin, D.-Y. Tsai, L. Tang, C.-T. Hsieh, C.-Y. Chou, P.-H. Chang, and L. Hsu, ONNC: A compilation framework connecting ONNX to proprietary deep learning accelerators, (IEEE International Conference on Artificial Intelligence Circuits and Systems, Hsinchu, Taiwan), 2019, pp. 214-218.
N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov, N. Gibson, J. Hegeman, M. Lele, R. Levenstein, and J. Montgomery, Glow: Graph lowering compiler techniques for neural networks, arXive preprint, 2018. https://doi.org/10.48550/arXiv.1805.00907
M. Zhang, Z. Hu, and M. Li, DUET: A compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture, (IEEE International Parallel and Distributed Processing Symposium, IEEE Portland, OR, 2021, pp. 151-161.
ETRI, NEST-C, 2021. https://github.com/etri/nest-compiler
Y. Ding, L. Zhu, Z. Jia, G. Pekhimenko, and S. Han, IOS: Interoperator scheduler for CNN acceleration, Proc. Machine Learn. Syst. 3 (2021), 167-180.
L. Ma, Z. Xie, Z. Yang, J. Xue, Y. Miao, W. Cui, W. Hu, F. Yang, L. Zhang, and L. Zhou, RAMMER: Enabling holistic deep learning compiler optimizations with rTasks, (14th USENIX Symposium on Operating Systems Design and Implementation), 2020, pp. 881-897.
T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, VTA: an open hardware-software stack for deep learning, arXive preprint, 2018. arXiv preprint arXiv: 1807.04188. https://doi.org/10.48550/arXiv.1807.04188
ONNX, ONNX operators, 2022. https://github.com/onnx/onnx/blob/main/docs/Operators.md
Y. Xing, S. Liang, L. Sui, X. Jia, J. Qiu, X. Liu, Y. Wang, Y. Shan, and Y. Wang, DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators, IEEE Trans. Comput.-Aided Design Integrated Circ. Syst. 39 (2020), no. 10, 2668-2681. https://doi.org/10.1109/TCAD.2019.2930577
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, ImageNet: A large-scale hierarchical image database, (IEEE Conference on Computer Vision and Pattern Recognition IEEE, Miami, FL, USA), 2009, pp. 248-255.
M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, European Conference on Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, (eds.), Springer, Cham, 2014, pp. 818-833.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM. 60 (2017), no. 6, 84-90. https://doi.org/10.1145/3065386
C. Szegedy, W. Liu, Y. Jia, et al., Going deeper with convolutions, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA), 2015, pp. 1-9.
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA), 2016, pp. 770-778.
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA), 2017, pp. 1492-1500.
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXive preprint, 2016. https://doi.org/10.48550/arXiv.1602.07360
ONNX, Onnx model zoo, 2022. https://github.com/onnx/models
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, aeXive preprint, 2018. https://doi.org/10.48550/arXiv.1802.04730
N. P. Jouppi, C. Young, N. Patil, et al., In-datacenter performance analysis of a tensor processing unit, (Proceedings of the 44th Annual International Symposium on Computer Architecture, Association for Computing Machinery, Toronto, Canada), 2017, pp. 1-12.
Z. Chen, C. H. Yu, T. Morris, J. Tuyls, Y. H. Lai, J. Roesch, E. Delaye, V. Sharma, and Y. Wang, Bring your own codegen to deep learning compiler, arXive preprint, 2021. https://doi.org/10.48550/arXiv.2105.03215

ETRI Journal

PartitionTuner: An operator scheduler for deep-learning compilers supporting multiple heterogeneous processing units

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)