DOI QR코드

DOI QR Code

NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators

  • Jeman Park (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Misun Yu (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Jinse Kwon (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Junmo Park (Samsung Electronics) ;
  • Jemin Lee (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Yongin Kwon (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute)
  • Received : 2024.03.24
  • Accepted : 2024.08.13
  • Published : 2024.10.10

Abstract

Deep learning (DL) has significantly advanced artificial intelligence (AI); however, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for general-purpose GPUs, leading to inefficiencies on specialized accelerators such as neural processing units (NPUs) and processing-in-memory (PIM) devices. These accelerators are designed to optimize both throughput and energy efficiency but they require more tailored optimizations. To address these limitations, we propose the NEST compiler (NEST-C), a novel DL framework that improves the deployment and performance of models across various AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph partitioning, and multi-level intermediate representation (IR) integration for efficient execution on diverse hardware platforms. Our results show that NEST-C significantly enhances computational efficiency and adaptability across various AI accelerators, achieving higher throughput, lower latency, improved resource utilization, and greater model portability. These benefits contribute to more efficient DL model deployment in modern AI applications.

Keywords

Acknowledgement

This study is supported by a grant from the Institute of Information & Communications Technology Planning & Evaluation (IITP), funded by the Korean government (MSIT) (No. RS-2023-00277060, Development of OpenEdge AI SoC hardware and software platform).

References

  1. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, PyTorch: an imperative style, high-performance deep learning library, (Proc. 33rd Int. Conf. Neural Inf. Process. Syst., Vol. 32, Curran Associates Inc., Red Hook, NY, USA), 2019.
  2. ONNX Contributors, Open Neural Network Exchange (ONNX), 2024. https://github.com/onnx/onnx,_2024. Accessed: 2024-03-18.
  3. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard, TensorFlow: a system for large-scale machine learning, (12th USENIX Symp. Operating Syst. Des. Implementation (OSDI'16)., Savannah, GA, USA), 2016, pp. 265-283.
  4. M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, The deep learning compiler: a comprehensive survey, IEEE Trans. Parallel Distrib. Syst. 32 (2020), no. 3, 708-727.
  5. T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, and L. Ceze, TVM: an automated end-to-end optimizing compiler for deep learning, (13th USENIX Symp. Operating Syst. Des. Implementation (OSDI'18), Carlsbad, CA, USA), 2018, pp. 578-594.
  6. N. Rotem, J. Fix, S. Abdulrasool, G. Catron, S. Deng, R. Dzhabarov, N. Gibson, J. Hegeman, M. Lele, and R. Levenstein, Glow: graph lowering compiler techniques for neural networks, arXiv preprint, 2018. https://doi.org/10.48550/arXiv.1805.00907
  7. C. Leary and T. Wang, XLA: TensorFlow, compiled, 2017. TensorFlow Dev Summit.
  8. AiM Future, The future of artificial intelligence: AiM future's product lineup, 2023. https://aimfuture.ai. Accessed: 2024-03-22.
  9. OPENEDGES, Neural processing unit (NPU) IP-ENLIGHT, 2022. URL https://www.openedges.com/npu. Accessed: 2024-03-22.
  10. J.-W. Jang, S. Lee, D. Kim, H. Park, A. S. Ardestani, Y. Choi, C. Kim, Y. Kim, H. Yu, H. Abdel-Aziz, J.-S. Park, H. Lee, D. Lee, M. W. Kim, H. Jung, H. Nam, D. Lim, S. Lee, J.-H. Song, S. Kwon, J. Hassoun, S. Lim, and C. Choi, Sparsity-aware and re-configurable NPU architecture for Samsung Flagship Mobile SoC, (ACM/IEEE 48th Annu. Int. Symp. Comput. Archit., Valencia, Spain), 2021, pp. 15-28.
  11. Qualcomm, Unlocking on-device generative AI with an NPU and heterogeneous computing, 2024. https://www.qualcomm.com. Accessed: 2024-03-22.
  12. Apple, Deploying transformers on the Apple Neural Engine, 2023. https://machinelearning.apple.com/research/deployingtransformers-on-the-apple-neural-engine. Accessed: 2024-03-22.
  13. Y. Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, and B. An, System architecture and software stack for GDDR6-AiM, (IEEE Hot Chips 34 Symp., Cupertino, CA, USA), 2022, pp. 1-25.
  14. Samsung, HBM-PIM: cutting-edge memory technology to accelerate next-generation AI, 2023. https://semiconductor.samsung.com/. Accessed: 2024-03-18.
  15. ETRI, NEST-C. https://gitlab.com/ones-ai/nest-compiler. Accessed: 2024-03-22.
  16. C. Lattner and V. Adve, LLVM: a compilation framework for lifelong program analysis & transformation, (Int. Symp. Code Gener. Optim., San Jose, CA, USA), 2004, pp. 75-86.
  17. J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock, M. Kirisame, T. Chen, and Z. Tatlock, Relay: a new IR for machine learning frameworks, (Proc. 2nd ACM SIGPLAN Int. Workshop Mach. Learn. Program. Lang., Association for Computing Machinery, Philadelphia, PA, USA), 2018, pp. 58-68.
  18. J. Dean, Machine learning for systems and systems for machine learning, Presentation at 2017 Conf. Neural Inf. Process. Syst., Curran Associates, Long Beach, CA, USA, 2017.
  19. Meta, Glow's Graph IR optimization. https://github.com/pytorch/glow/blob/master/docs/Optimizations.md. Accessed: 2024-02-22.
  20. J. Lee, M. Yu, Y. Kwon, and T. Kim, Quantune: Post-training quantization of convolutional neural networks using extreme gradient boosting for fast deployment, Future Gener. Comput. Syst. 132 (2022), 124-135.
  21. Y. Misun, K. Yongin, L. Jemin, P. Jeman, P. Junmo, and K. Taeho, PartitionTuner: an operator scheduler for deep-learning compilers supporting multiple heterogeneous processing units, ETRI J. 45 (2023), no. 2, 318-328.
  22. R. Sousa, M. Pereira, Y. Kwon, T. Kim, N. Jung, C. S. Kim, M. Frank, and G. Araujo, Tensor slicing and optimization for multicore NPUs, J. Parallel Distrib. Comput. 175 (2023), 66-79.
  23. T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, A hardware-software blueprint for flexible deep learning specialization, IEEE Micro 39 (2019), no. 5, 8-16.
  24. J. Deng, W. Dong, and R. Socher, ImageNet: a large-scale hierarchical image database, (IEEE Conf. Comput. Vision Pattern Recognit., Miami, FL, USA), 2009, pp. 248-255.
  25. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, (Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Boston, MA, USA), 2015, pp. 1-9.
  26. S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, (IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Honolulu, HI, USA), 2017, pp. 1492-1500.
  27. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, (IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Las Vegas, NV, USA), 2016, pp. 770-778.
  28. L. Deng, The MNIST database of handwritten digit images for machine learning research, IEEE Signal Process. Mag. 29 (2012), no. 6, 141-142.
  29. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (1998), no. 11, 2278-2324.
  30. F. N. Iandola, S. Han, and M. W. Moskewicz, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv preprint, 2016. https://doi.org/10.48550/arXiv.1602.07360