Search | Korea Science

NEST-C: A deep learning compiler framework for heterogeneous computing systems with artificial intelligence accelerators

Jeman Park;Misun Yu;Jinse Kwon;Junmo Park;Jemin Lee;Yongin Kwon
- ETRI Journal
- /
- v.46 no.5
- /
- pp.851-864
- /
- 2024
Deep learning (DL) has significantly advanced artificial intelligence (AI); however, frameworks such as PyTorch, ONNX, and TensorFlow are optimized for general-purpose GPUs, leading to inefficiencies on specialized accelerators such as neural processing units (NPUs) and processing-in-memory (PIM) devices. These accelerators are designed to optimize both throughput and energy efficiency but they require more tailored optimizations. To address these limitations, we propose the NEST compiler (NEST-C), a novel DL framework that improves the deployment and performance of models across various AI accelerators. NEST-C leverages profiling-based quantization, dynamic graph partitioning, and multi-level intermediate representation (IR) integration for efficient execution on diverse hardware platforms. Our results show that NEST-C significantly enhances computational efficiency and adaptability across various AI accelerators, achieving higher throughput, lower latency, improved resource utilization, and greater model portability. These benefits contribute to more efficient DL model deployment in modern AI applications.
https://doi.org/10.4218/etrij.2024-0139 인용 PDF

Transfer Matrix Algorithm for Computing the Geometric Quantities of a Square Lattice Polymer

Lee, Julian
- Journal of the Korean Physical Society
- /
- v.73 no.12
- /
- pp.1808-1813
- /
- 2018
I develop a transfer matrix algorithm for computing the geometric quantities of a square lattice polymer with nearest-neighbor interactions. The radius of gyration, the end-to-end distance, and the monomer-to-end distance were computed as functions of the temperature. The computation time scales as ${\lesssim}1.8^N$ with a chain length N, in contrast to the explicit enumeration where the scaling is ${\sim}2.7^N$. Various techniques for reducing memory requirements are implemented.
https://doi.org/10.3938/jkps.73.1808 인용 KSCI

Performance Evaluation and Analysis of Multiple Scenarios of Big Data Stream Computing on Storm Platform

Sun, Dawei;Yan, Hongbin;Gao, Shang;Zhou, Zhangbing
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.12 no.7
- /
- pp.2977-2997
- /
- 2018
In big data era, fresh data grows rapidly every day. More than 30,000 gigabytes of data are created every second and the rate is accelerating. Many organizations rely heavily on real time streaming, while big data stream computing helps them spot opportunities and risks from real time big data. Storm, one of the most common online stream computing platforms, has been used for big data stream computing, with response time ranging from milliseconds to sub-seconds. The performance of Storm plays a crucial role in different application scenarios, however, few studies were conducted to evaluate the performance of Storm. In this paper, we investigate the performance of Storm under different application scenarios. Our experimental results show that throughput and latency of Storm are greatly affected by the number of instances of each vertex in task topology, and the number of available resources in data center. The fault-tolerant mechanism of Storm works well in most big data stream computing environments. As a result, it is suggested that a dynamic topology, an elastic scheduling framework, and a memory based fault-tolerant mechanism are necessary for providing high throughput and low latency services on Storm platform.
https://doi.org/10.3837/tiis.2018.07.002 인용 PDF KSCI

Limiting CPU Frequency Scaling Considering Main Memory Accesses (주메모리 접근을 고려한 CPU 주파수 조정 제한)

Park, Moonju
- KIISE Transactions on Computing Practices
- /
- v.20 no.9
- /
- pp.483-491
- /
- 2014
Contemporary computer systems exploits DVFS (Dynamic Voltage/Frequency Scaling) technology for balancing performance and power consumption. The efficiency of DVFS depends on how much performance we get for larger power consumption due to elevated CPU frequency. Especially for memory-bounded applications, higher CPU frequency often does not result in higher performance. In this paper, we present an upper bound of CPU frequency scaling based on memory accesses. It is observed that the performance gain due to higher CPU frequency is limited by memory accesses (last level cache misses) per instructions by experiments. Using the results, we present the CPU frequency upper bound with little performance gain. Experimental results show that for a memory-bounded application, applying the frequency upper bound enhances the energy efficiency of the application by above 30%.
https://doi.org/10.5626/KTCP.2014.20.9.483 인용

Implementation of External Memory Expansion Device for Large Image Processing (대규모 영상처리를 위한 외장 메모리 확장장치의 구현)

Choi, Yongseok;Lee, Hyejin
- Journal of Broadcast Engineering
- /
- v.23 no.5
- /
- pp.606-613
- /
- 2018
This study is concerned with implementing an external memory expansion device for large-scale image processing. It consists of an external memory adapter card with a PCI(Peripheral Component Interconnect) Express Gen3 x8 interface mounted on a graphics workstation for image processing and an external memory board with external DDR(Dual Data Rate) memory. The connection between the memory adapter card and the external memory board is made through the optical interface. In order to access the external memory, both Programmable I/O and DMA(Direct Memory Access) methods can be used to efficiently transmit and receive image data. We implemented the result of this study using the boards equipped with Altera Stratix V FPGA(Field Programmable Gate Array) and 40G optical transceiver and the test result shows 1.6GB/s bandwidth performance.. It can handle one channel of 4K UHD(Ultra High Density) image. We will continue our study in the future for showing bandwidth of 3GB/s or more.
https://doi.org/10.5909/JBE.2018.23.5.606 인용 PDF KSCI KPUBS

A Virtualized Kernel for Effective Memory Test (효과적인 메모리 테스트를 위한 가상화 저널)

Park, Hee-Kwon;Youn, Dea-Seok;Choi, Jong-Moo
- Journal of KIISE:Computer Systems and Theory
- /
- v.34 no.12
- /
- pp.618-629
- /
- 2007
In this paper, we propose an effective memory test environment, called a virtualized kernel, for 64bit multi-core computing environments. The term of effectiveness means that we can test all of the physical memory space, even the memory space occupied by the kernel itself, without rebooting. To obtain this capability, our virtualized kernel provides four mechanisms. The first is direct accessing to physical memory both in kernel and user mode, which allows applying various test patterns to any place of physical memory. The second is making kernel virtualized so that we can run two or more kernel image at the different location of physical memory. The third is isolating memory space used by different instances of virtualized kernel. The final is kernel hibernation, which enables the context switch between kernels. We have implemented the proposed virtualized kernel by modifying the latest Linux kernel 2.6.18 running on Intel Xeon system that has two 64bit dual-core CPUs with hyper-threading technology and 2GB main memory. Experimental results have shown that the two instances of virtualized kernel run at the different location of physical memory and the kernel hibernation works well as we have designed. As the results, the every place of physical memory can be tested without rebooting.
PDF KSCI

A High Performance Flash Memory Solid State Disk (고성능 플래시 메모리 솔리드 스테이트 디스크)

Yoon, Jin-Hyuk;Nam, Eyee-Hyun;Seong, Yoon-Jae;Kim, Hong-Seok;Min, Sang-Lyul;Cho, Yoo-Kun
- Journal of KIISE:Computing Practices and Letters
- /
- v.14 no.4
- /
- pp.378-388
- /
- 2008
Flash memory has been attracting attention as the next mass storage media for mobile computing systems such as notebook computers and UMPC(Ultra Mobile PC)s due to its low power consumption, high shock and vibration resistance, and small size. A storage system with flash memory excels in random read, sequential read, and sequential write. However, it comes short in random write because of flash memory's physical inability to overwrite data, unless first erased. To overcome this shortcoming, we propose an SSD(Solid State Disk) architecture with two novel features. First, we utilize non-volatile FRAM(Ferroelectric RAM) in conjunction with NAND flash memory, and produce a synergy of FRAM's fast access speed and ability to overwrite, and NAND flash memory's low and affordable price. Second, the architecture categorizes host write requests into small random writes and large sequential writes, and processes them with two different buffer management, optimized for each type of write request. This scheme has been implemented into an SSD prototype and evaluated with a standard PC environment benchmark. The result reveals that our architecture outperforms conventional HDD and other commercial SSDs by more than three times in the throughput for random access workloads.
PDF KSCI

Hybrid Transactional Memory using Sampling-based Retry Policy in Multi-Core Environment (멀티코어 환경에서 샘플링 기반 재시도 정책을 이용한 하이브리드 트랜잭셔널 메모리)

Kang, Moon-Hwan;Jang, Yeon-Woo;Yoon, Min;Chang, Jae-Woo
- The Journal of Korean Institute of Next Generation Computing
- /
- v.13 no.2
- /
- pp.49-61
- /
- 2017
Transactional Memory (TM) has greatly changed the parallel programming paradigm for transaction processing and is classified into STM, HTM, HyTM according to hardware or software frameworks. However, the existing studies have a problem that they provide static retry policy for all workloads. To solve the problems, we propose an hybrid transactional memory scheme using sampling-based adaptive retry policy in multi-core environment. First, the proposed scheme determines whether to use STM or HTM according to the characteristic of a transaction. Otherwise, it executes HTM and STM concurrently by using a bloom filter. Second, the proposed scheme provides adaptive retry policy for HTM according to the characteristic of transactions in each workload. Finally, through the experimental performance evaluation using STAMP, the proposed scheme shows 10~20% better performance than the existing schemes.

Efficient Data Pre-fetching Scheme for InfiniBand based High Performance Clusters (인피니밴드 기반 고성능 클러스터를 위한 효율적인 데이터 선반입 기법)

Kim, Bongjae;Jung, Jinman;Min, Hong;Heo, Junyoung;Jung, Hyedong
- KIISE Transactions on Computing Practices
- /
- v.23 no.5
- /
- pp.293-298
- /
- 2017
Recently, much research has been devoted to implementing and provisioning high-performance computing environment using clusters with multiple computers and high-performance networking technologies. In-memory based Key-Value stores, such as Redis or Memcached, are widely used in high performance cluster environments to improve the data processing performance. We can distribute data at different storage nodes, and each computing node can access it at a high speed using these In-memory based Key-Value stores. InfiniBand is a de-facto technology that is widely used to interconnect each node of a cluster. In this paper, we propose a new data pre-fetching scheme for Key-Value store based on high performance clusters to improve the performance. The proposed scheme utilizes the data transfer characteristics of InfiniBand. The results of the simulation show that the proposed scheme can reduce the data transfer time by up to about 28%.
https://doi.org/10.5626/KTCP.2017.23.5.293 인용 KSCI

Design and Implementation of an Efficient FTL for Large Block Flash Memory using Improved Hybrid Mapping (향상된 혼합 사상기법을 이용한 효율적인 대블록 플래시 메모리 변환계층 설계 및 구현)

Park, Dong-Joo;Kwak, Kyoung-Hoon
- Journal of KIISE:Computing Practices and Letters
- /
- v.15 no.1
- /
- pp.1-13
- /
- 2009
Flash memory is widely used as a storage medium of mobile devices such as MP3 players, cellular phones and digital cameras due to its tiny size, low power consumption and shock resistant characteristics. Currently, there are many studies to replace HDD with flash memory because of its numerous strong points. To use flash memory as a storage medium, FTL(Flash Translation Layer) is required since flash memory has erase-before-write constraints and sizes of read/write unit and erase unit are different from each other. Recently, new type of flash memory called "large block flash memory" is introduced. The large block flash memory has different physical structure and characteristics from previous flash memory. So existing FTLs are not efficiently operated on large block flash memory. In this paper, we propose an efficient FTL for large block flash memory based on FAST(Fully Associative Sector Translation) scheme and page-level mapping on data blocks.
PDF KSCI

Search Result 766, Processing Time 0.03 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)