Search | Korea Science

XEM: Tensor accelerator for AB21 supercomputing artificial intelligence processor

Won Jeon;Mi Young Lee;Joo Hyun Lee;Chun-Gi Lyuh
- ETRI Journal
- /
- v.46 no.5
- /
- pp.839-850
- /
- 2024
As computing systems become increasingly larger, high-performance computing (HPC) is gaining importance. In particular, as hyperscale artificial intelligence (AI) applications, such as large language models emerge, HPC has become important even in the field of AI. Important operations in hyperscale AI and HPC are mainly linear algebraic operations based on tensors. An AB21 supercomputing AI processor has been proposed to accelerate such applications. This study proposes a XEM accelerator to accelerate linear algebraic operations in an AB21 processor effectively. The XEM accelerator has outer product-based parallel floating-point units that can efficiently process tensor operations. We provide hardware details of the XEM architecture and introduce new instructions for controlling the XEM accelerator. Additionally, hardware characteristic analyses based on chip fabrication and simulator-based functional verification are conducted. In the future, the performance and functionalities of the XEM accelerator will be verified using an AB21 processor.
https://doi.org/10.4218/etrij.2024-0141 인용 PDF

Design of Stand-alone AI Processor for Embedded System (독립운용이 가능한 임베디드 인공지능 프로세서 설계)

Cho, Kwon Neung;Choi, Do Young;Jeong, Young Woo;Lee, Seung Eun
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2021.05a
- /
- pp.600-602
- /
- 2021
With the development of the mobile industry and growing interest in artificial intelligence (AI) technology, a lot of research for AI processors which applicable to embedded systems is under study. When implementing AI to embedded systems, the design should be considered the restriction of resource and power consumption. Moreover, it is efficient to include a dedicated hardware accelerator in order to complement the low computational performance of the embedded system. In this paper, we propose an stand-alone embedded AI processor. The proposed AI processor includes a hardware accelerator that is dedicated to the distance-based AI algorithm and a general-purpose MCU that supports flexible programmability for application to various embedded systems. The AI processor was designed with Verilog HDL and verified by implementing on Field Programmable Gate Array (FPGA).
PDF

Trends of Low-Precision Processing for AI Processor (NPU 반도체를 위한 저정밀도 데이터 타입 개발 동향)

Kim, H.J.;Han, J.H.;Kwon, Y.S.
- Electronics and Telecommunications Trends
- /
- v.37 no.1
- /
- pp.53-62
- /
- 2022
With increasing size of transformer-based neural networks, a light-weight algorithm and efficient AI accelerator has been developed to train these huge networks in practical design time. In this article, we present a survey of state-of-the-art research on the low-precision computational algorithms especially for floating-point formats and their hardware accelerator. We describe the trends by focusing on the work of two leading research groups-IBM and Seoul National University-which have deep knowledge in both AI algorithm and hardware architecture. For the low-precision algorithm, we summarize two efficient floating-point formats (hybrid FP8 and radix-4 FP4) with accuracy-preserving algorithms for training on the main research stream. Moreover, we describe the AI processor architecture supporting the low-bit mixed precision computing unit including the integer engine.
https://doi.org/10.22648/ETRI.2022.J.370106 인용 PDF

AB9: A neural processor for inference acceleration

Cho, Yong Cheol Peter;Chung, Jaehoon;Yang, Jeongmin;Lyuh, Chun-Gi;Kim, HyunMi;Kim, Chan;Ham, Je-seok;Choi, Minseok;Shin, Kyoungseon;Han, Jinho;Kwon, Youngsu
- ETRI Journal
- /
- v.42 no.4
- /
- pp.491-504
- /
- 2020
We present AB9, a neural processor for inference acceleration. AB9 consists of a systolic tensor core (STC) neural network accelerator designed to accelerate artificial intelligence applications by exploiting the data reuse and parallelism characteristics inherent in neural networks while providing fast access to large on-chip memory. Complementing the hardware is an intuitive and user-friendly development environment that includes a simulator and an implementation flow that provides a high degree of programmability with a short development time. Along with a 40-TFLOP STC that includes 32k arithmetic units and over 36 MB of on-chip SRAM, our baseline implementation of AB9 consists of a 1-GHz quad-core setup with other various industry-standard peripheral intellectual properties. The acceleration performance and power efficiency were evaluated using YOLOv2, and the results show that AB9 has superior performance and power efficiency to that of a general-purpose graphics processing unit implementation. AB9 has been taped out in the TSMC 28-nm process with a chip size of 17 × 23 ㎟. Delivery is expected later this year.
https://doi.org/10.4218/etrij.2020-0134 인용 PDF KSCI

Assessment of Radiation Dose from Radioactive Wedge Filters during High-Energy X-Ray Therapy

Back, Geum-mun;Park, Sung Ho;Kim, Tae-Hyung
- Progress in Medical Physics
- /
- v.28 no.2
- /
- pp.45-48
- /
- 2017
This paper evaluated the amount of radiation generated by wedge filters during radiation therapy using a high-energy linear accelerator, and the dose to the worker during wedge replacement. After 10-MV photon beam was irradiated with wedge filter, the wedge was removed from the linear accelerator, and the dose rate and energy spectrum were measured. The initial measurement was approximately 1 uSv/h, and the radiation level was reduced to 0.3 uSv/h after 6 min. The effective half-life derived from the dose rate measurement was approximately 3.5 min, and the influence of AI-28 was about 53%. From the energy spectrum measurements, a peak of 1,799 keV was measured for AI-28, while the peak for Co-58 was not measured in the control room. The peaks for Au-106 and Cd-105 were found only measurement was done without wedge removement from the linear accelerator. The additional doses received by the radiation worker during wedge replacement were estimated to be 0.08-0.4 mSv per year.
https://doi.org/10.14316/pmp.2017.28.2.45 인용 PDF

Implementation of FPGA-based Accelerator for GRU Inference with Structured Compression (구조적 압축을 통한 FPGA 기반 GRU 추론 가속기 설계)

Chae, Byeong-Cheol
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.26 no.6
- /
- pp.850-858
- /
- 2022
To deploy Gate Recurrent Units (GRU) on resource-constrained embedded devices, this paper presents a reconfigurable FPGA-based GRU accelerator that enables structured compression. Firstly, a dense GRU model is significantly reduced in size by hybrid quantization and structured top-k pruning. Secondly, the energy consumption on external memory access is greatly reduced by the proposed reuse computing pattern. Finally, the accelerator can handle a structured sparse model that benefits from the algorithm-hardware co-design workflows. Moreover, inference tasks can be flexibly performed using all functional dimensions, sequence length, and number of layers. Implemented on the Intel DE1-SoC FPGA, the proposed accelerator achieves 45.01 GOPs in a structured sparse GRU network without batching. Compared to the implementation of CPU and GPU, low-cost FPGA accelerator achieves 57 and 30x improvements in latency, 300 and 23.44x improvements in energy efficiency, respectively. Thus, the proposed accelerator is utilized as an early study of real-time embedded applications, demonstrating the potential for further development in the future.
https://doi.org/10.6109/jkiice.2022.26.6.850 인용 PDF KSCI

AI Accelerator Design for Edge Devices (엣지 디바이스를 위한 AI 가속기 설계 방법)

Whoi Ree, Ha;Hyunjun Kim;Yunheung Paek
- Annual Conference of KIPS
- /
- 2024.05a
- /
- pp.723-726
- /
- 2024
단일 dataflow 를 지원하는 DNN 가속기는 자원 효율적인 성능을 보이지만, 여러 DNN 모델에 대해서 가속 효과가 제한적입니다. 반면에 모든 dataflow 를 지원하여 매 레이어마다 최적의 dataflow를 사용하여 가속하는 reconfigurable dataflow accelerator (RDA)는 굉장한 가속 효과를 보이지만 여러 dataflow 를 지원하는 과정에서 필요한 추가 하드웨어로 인하여 효율적이지 못합니다. 따라서 본 연구는 제한된 dataflow 만을 지원하여 추가 하드웨어 요구사항을 감소시키고, 중복되는 하드웨어의 재사용을 통해 최적화하는 새로운 가속기 설계를 제안합니다. 이 방식은 자원적 한계가 뚜렷한 엣지 디바이스에 RDA 방식을 적용하는데 필수적이며, 기존 RDA 의 단점을 최소화하여 성능과 자원 효율성의 최적점을 달성합니다. 실험 결과, 제안된 가속기는 기존 RDA 대비 32% 더 높은 에너지 효율을 보이며, latency 는 불과 1%의 차이를 보였습니다.
https://doi.org/10.3745/PKIPS.y2024m05a.723 인용 PDF

A Study on Design Space Exploration on AI accelerator (AI 가속기 설계 영역 탐색에 대한 연구)

Lee, Dong-Ju;Paek, Yun-Heung
- Annual Conference of KIPS
- /
- 2022.11a
- /
- pp.535-537
- /
- 2022
AI 가속기는 머신 러닝 및 딥 러닝을 포함한 인공 지능 및 기계 학습 응용 프로그램의 연산을 더 빠르게 수행하도록 설계된 일종의 하드웨어 가속기 또는 컴퓨터 시스템이다. 가속기를 설계하기 위해선 설계 영역 탐색(Design Space Exploration)을 하여야 하고 여러 인공지능 중에서도 합성 곱 신경망(CNN)에 대한 설계 영역 탐색을 소개한다.
https://doi.org/10.3745/PKIPS.y2022m11a.535 인용 PDF

NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators

Jeman Park;Misun Yu;Jinse Kwon;Junmo Park;Jemin Lee;Yongin Kwon
- ETRI Journal
- /
- v.46 no.5
- /
- pp.851-864
- /
- 2024
Deep learning (DL) has significantly advanced artificial intelligence (AI); however, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for general-purpose GPUs, leading to inefficiencies on specialized accelerators such as neural processing units (NPUs) and processing-in-memory (PIM) devices. These accelerators are designed to optimize both throughput and energy efficiency but they require more tailored optimizations. To address these limitations, we propose the NEST compiler (NEST-C), a novel DL framework that improves the deployment and performance of models across various AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph partitioning, and multi-level intermediate representation (IR) integration for efficient execution on diverse hardware platforms. Our results show that NEST-C significantly enhances computational efficiency and adaptability across various AI accelerators, achieving higher throughput, lower latency, improved resource utilization, and greater model portability. These benefits contribute to more efficient DL model deployment in modern AI applications.
https://doi.org/10.4218/etrij.2024-0139 인용 PDF

Functionality-based Processing-In-Memory Accelerator for Deep Neural Networks (딥뉴럴네트워크를 위한 기능성 기반의 핌 가속기)

Kim, Min-Jae;Kim, Shin-Dug
- Annual Conference of KIPS
- /
- 2020.11a
- /
- pp.8-11
- /
- 2020
4 차 산업혁명 시대의 도래와 함께 AI, ICT 기술의 융합이 진행됨에 따라, 유저 레벨의 디바이스에서도 AI 서비스의 요청이 실현되었다. 이미지 처리와 관련된 AI 서비스는 피사체 판별, 불량품 검사, 자율주행 등에 이용되고 있으며, 특히 Deep Convolutional Neural Network (DCNN)은 이미지의 특색을 파악하는 데 뛰어난 성능을 보여준다. 하지만, 이미지의 크기가 커지고, 신경망이 깊어짐에 따라 연산 처리에 있어 낮은 데이터 지역성과 빈번한 메모리 참조를 야기했다. 이에 따라, 기존의 계층적 시스템 구조는 DCNN 을 scalable 하고 빠르게 처리하는 데 한계를 보인다. 본 연구에서는 DCNN 의 scalable 하고 빠른 처리를 위해 3 차원 메모리 구조의 Processing-In-Memory (PIM) 가속기를 제안한다. 이를 위해 기존 3 차원 메모리인 Hybrid Memory Cube (HMC)에 하드웨어 및 소프트웨어 모듈을 추가로 구성하였다. 구체적으로, Processing Element (PE)간 데이터를 공유할 수 있는 공유 캐시 및 소프트웨어 스택, 파이프라인화된 곱셈기 및 듀얼 프리페치 버퍼를 구성하였다. 이를 유명 DCNN 알고리즘 LeNet, AlexNet, ZFNet, VGGNet, GoogleNet, RestNet 에 대해 성능 평가를 진행한 결과 기존 HMC 대비 40.3%의 속도 향상을 29.4%의 대역폭 향상을 보였다.
https://doi.org/10.3745/PKIPS.y2020m11a.8 인용 PDF

Search Result 19, Processing Time 0.032 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)