• Title/Summary/Keyword: deep learning compiler

Search Result 4, Processing Time 0.017 seconds

Trends of Compiler Development for AI Processor (인공지능 프로세서 컴파일러 개발 동향)

  • Kim, J.K.;Kim, H.J.;Cho, Y.C.P.;Kim, H.M.;Lyuh, C.G.;Han, J.;Kwon, Y.
    • Electronics and Telecommunications Trends
    • /
    • v.36 no.2
    • /
    • pp.32-42
    • /
    • 2021
  • The rapid growth of deep-learning applications has invoked the R&D of artificial intelligence (AI) processors. A dedicated software framework such as a compiler and runtime APIs is required to achieve maximum processor performance. There are various compilers and frameworks for AI training and inference. In this study, we present the features and characteristics of AI compilers, training frameworks, and inference engines. In addition, we focus on the internals of compiler frameworks, which are based on either basic linear algebra subprograms or intermediate representation. For an in-depth insight, we present the compiler infrastructure, internal components, and operation flow of ETRI's "AI-Ware." The software framework's significant role is evidenced from the optimized neural processing unit code produced by the compiler after various optimization passes, such as scheduling, architecture-considering optimization, schedule selection, and power optimization. We conclude the study with thoughts about the future of state-of-the-art AI compilers.

NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators

  • Jeman Park;Misun Yu;Jinse Kwon;Junmo Park;Jemin Lee;Yongin Kwon
    • ETRI Journal
    • /
    • v.46 no.5
    • /
    • pp.851-864
    • /
    • 2024
  • Deep learning (DL) has significantly advanced artificial intelligence (AI); however, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for general-purpose GPUs, leading to inefficiencies on specialized accelerators such as neural processing units (NPUs) and processing-in-memory (PIM) devices. These accelerators are designed to optimize both throughput and energy efficiency but they require more tailored optimizations. To address these limitations, we propose the NEST compiler (NEST-C), a novel DL framework that improves the deployment and performance of models across various AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph partitioning, and multi-level intermediate representation (IR) integration for efficient execution on diverse hardware platforms. Our results show that NEST-C significantly enhances computational efficiency and adaptability across various AI accelerators, achieving higher throughput, lower latency, improved resource utilization, and greater model portability. These benefits contribute to more efficient DL model deployment in modern AI applications.

PartitionTuner: An operator scheduler for deep-learning compilers supporting multiple heterogeneous processing units

  • Misun Yu;Yongin Kwon;Jemin Lee;Jeman Park;Junmo Park;Taeho Kim
    • ETRI Journal
    • /
    • v.45 no.2
    • /
    • pp.318-328
    • /
    • 2023
  • Recently, embedded systems, such as mobile platforms, have multiple processing units that can operate in parallel, such as centralized processing units (CPUs) and neural processing units (NPUs). We can use deep-learning compilers to generate machine code optimized for these embedded systems from a deep neural network (DNN). However, the deep-learning compilers proposed so far generate codes that sequentially execute DNN operators on a single processing unit or parallel codes for graphic processing units (GPUs). In this study, we propose PartitionTuner, an operator scheduler for deep-learning compilers that supports multiple heterogeneous PUs including CPUs and NPUs. PartitionTuner can generate an operator-scheduling plan that uses all available PUs simultaneously to minimize overall DNN inference time. Operator scheduling is based on the analysis of DNN architecture and the performance profiles of individual and group operators measured on heterogeneous processing units. By the experiments for seven DNNs, PartitionTuner generates scheduling plans that perform 5.03% better than a static type-based operator-scheduling technique for SqueezeNet. In addition, PartitionTuner outperforms recent profiling-based operator-scheduling techniques for ResNet50, ResNet18, and SqueezeNet by 7.18%, 5.36%, and 2.73%, respectively.

Annotation-guided Code Partitioning Compiler for Homomorphic Encryption Program (지시문을 활용한 동형암호 프로그램 코드 분할 컴파일러)

  • Dongkwan Kim;Yongwoo Lee;Seonyoung Cheon;Heelim Choi;Jaeho Lee;Hoyun Youm;Hanjun Kim
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.7
    • /
    • pp.291-298
    • /
    • 2024
  • Despite its wide application, cloud computing raises privacy leakage concerns because users should send their private data to the cloud. Homomorphic encryption (HE) can resolve the concerns by allowing cloud servers to compute on encrypted data without decryption. However, due to the huge computation overhead of HE, simply executing an entire cloud program with HE causes significant computation. Manually partitioning the program and applying HE only to the partitioned program for the cloud can reduce the computation overhead. However, the manual code partitioning and HE-transformation are time-consuming and error-prone. This work proposes a new homomorphic encryption enabled annotation-guided code partitioning compiler, called Heapa, for privacy preserving cloud computing. Heapa allows programmers to annotate a program about the code region for cloud computing. Then, Heapa analyzes the annotated program, makes a partition plan with a variable list that requires communication and encryption, and generates a homomorphic encryptionenabled partitioned programs. Moreover, Heapa provides not only two region-level partitioning annotations, but also two instruction-level annotations, thus enabling a fine-grained partitioning and achieving better performance. For six machine learning and deep learning applications, Heapa achieves a 3.61 times geomean performance speedup compared to the non-partitioned cloud computing scheme.