• Title/Summary/Keyword: memory data layout

Search Result 36, Processing Time 0.026 seconds

Memory data layout and DMA transfer technique research For efficient data transfer of CNN accelerator (CNN 가속기의 효율적인 데이터 전송을 위한 메모리 데이터 레이아웃 및 DMA 전송기법 연구)

  • Cho, Seok-Jae;Park, Sungkyung;Park, Chester Sungchung
    • Journal of IKEEE
    • /
    • v.24 no.2
    • /
    • pp.559-569
    • /
    • 2020
  • One of the deep-running algorithms, CNN's artificial intelligence application uses off-chip memory to store data on the Convolution Layer. DMA can reduce processor load at every data transfer. It can also reduce application performance degradation by varying the order in which data from the Convolution layer is transmitted to the global buffer of the accelerator. For basic layouts with continuous memory addresses, SG-DMA showed about 3.4 times performance improvement in pre-setting DMA compared to using ordinaly DMA, and for Ideal layouts with discontinuous memory addresses, the ordinal DMA was about 1396 cycles faster than SG-DMA. Experiments have shown that a combination of memory data layout and DMA can reduce the DMA preset load by about 86 percent.

A Technique for Improving the Performance of Cache Memories

  • Cho, Doosan
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.13 no.3
    • /
    • pp.104-108
    • /
    • 2021
  • In order to improve performance in IoT, edge computing system, a memory is usually configured in a hierarchical structure. Based on the distance from CPU, the access speed slows down in the order of registers, cache memory, main memory, and storage. Similar to the change in performance, energy consumption also increases as the distance from the CPU increases. Therefore, it is important to develop a technique that places frequently used data to the upper memory as much as possible to improve performance and energy consumption. However, the technique should solve the problem of cache performance degradation caused by lack of spatial locality that occurs when the data access stride is large. This study proposes a technique to selectively place data with large data access stride to a software-controlled cache. By using the proposed technique, data spatial locality can be improved by reducing the data access interval, and consequently, the cache performance can be improved.

An Optimal ILP Algorithm of Memory Access Variable Storage for DSP in Embedded System (임베디드 시스템에서 DSP를 위한 메모리 접근 변수 저장의 최적화 ILP 알고리즘)

  • Chang, Jeong-Uk;Lin, Chi-Ho
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.2
    • /
    • pp.59-66
    • /
    • 2013
  • In this paper, we proposed an optimal ILP algorithm on memory address code generation for DSP in embedded system. This paper using 0-1 ILP formulations DSP address generation units should minimize the memory variable data layout. We identify the possibility of the memory assignment of variable based on the constraints condition, and register the address code which a variable instructs in the program pointer. If the process sequence of the program is declared to the program pointer, then we apply the auto-in/decrement mode about the address code of the relevant variable. And we minimize the loads on the address registers to optimize the data layout of the variable. In this paper, in order to prove the effectiveness of the proposed algorithm, FICO Xpress-MP Modeling Tools were applied to the benchmark. The result that we apply a benchmark, an optimal memory layout of the proposed algorithm then the general declarative order memory on the address/modify register to reduce the number of loads, and reduced access to the address code. Therefor, we proved to reduce the execution time of programs.

Technology of the next generation low power memory system

  • Cho, Doosan
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.10 no.4
    • /
    • pp.6-11
    • /
    • 2018
  • As embedded memory technology evolves, the traditional Static Random Access Memory (SRAM) technology has reached the end of development. For deepening the manufacturing process technology, the next generation memory technology is highly required because of the exponentially increasing leakage current of SRAM. Non-volatile memories such as STT-MRAM (Spin Torque Transfer Magnetic Random Access Memory), PCM (Phase Change Memory) are good candidates for replacing SRAM technology in embedded memory systems. They have many advanced characteristics in the perspective of power consumption, leakage power, size (density) and latency. Nonetheless, nonvolatile memories have two major problems that hinder their use it the next-generation memory. First, the lifetime of the nonvolatile memory cell is limited by the number of write operations. Next, the write operation consumes more latency and power than the same size of the read operation.These disadvantages can be solved using the compiler. The disadvantage of non-volatile memory is in write operations. Therefore, when the compiler decides the layout of the data, it is solved by optimizing the write operation to allocate a lot of data to the SRAM. This study provides insights into how these compiler and architectural designs can be developed.

Large-scale 3D fast Fourier transform computation on a GPU

  • Jaehong Lee;Duksu Kim
    • ETRI Journal
    • /
    • v.45 no.6
    • /
    • pp.1035-1045
    • /
    • 2023
  • We propose a novel graphics processing unit (GPU) algorithm that can handle a large-scale 3D fast Fourier transform (i.e., 3D-FFT) problem whose data size is larger than the GPU's memory. A 1D FFT-based 3D-FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data-transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large-scale benchmarks and compare its performance with the state-of-the-art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU-based 3D-FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive-scale problems, whereas our method's performance is stable.

High-Performance FFT Using Data Reorganization (데이터 재구성 기법을 이용한 고성능 FFT)

  • Park Neungsoo;Choi Yungho
    • The KIPS Transactions:PartA
    • /
    • v.12A no.3 s.93
    • /
    • pp.215-222
    • /
    • 2005
  • The efficient utilization of cache memories is a key factor in achieving high performance for computing large signal transforms. Nonunit stride access in computation of large DFTs causes cache conflict misses, thereby resulting in poor cache performance. It leads to a severe degradation in overall performance. In this paper, we propose a dynamic data layout approach considering the memory hierarchy system. In our approach, data reorganization is performed between computation stages to reduce the number of cache misses. Also, we develop an efficient search algorithm to determine the optimal tree with the minimum execution time among possible factorization trees considering the size of DFTs and the data access stride. Our approach is applied to compute the fast Fourier Transform (FFT). Experiments were performed on Pentium 4, $Athlon^{TM}$ 64, Alpha 21264, UtraSPARC III. Experiment results show that our FFT achieve performance improvement of up to 3.37 times better than the previous FFT packages.

Effect of ASLR on Memory Duplicate Ratio in Cache-based Virtual Machine Live Migration

  • Piao, Guangyong;Oh, Youngsup;Sung, Baegjae;Park, Chanik
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.9 no.4
    • /
    • pp.205-210
    • /
    • 2014
  • Cache based live migration method utilizes a cache, which is accessible to both side (remote and local), to reduce the virtual machine migration time, by transferring only irredundant data. However, address space layout randomization (ASLR) is proved to reduce the memory duplicate ratio between targeted migration memory and the migration cache. In this pager, we analyzed the behavior of ASLR to find out how it changes the physical memory contents of virtual machines. We found that among six virtual memory regions, only the modification to stack influences the page-level memory duplicate ratio. Experiments showed that: (1) the ASLR does not shift the heap region in sub-page level; (2) the stack reduces the duplicate page size among VMs which performed input replay around 40MB, when ASLR was enabled; (3) the size of memory pages, which can be reconstructed from the fresh booted up state, also reduces by about 60MB by ASLR. With those observations, when applying cache-based migration method, we can omit the stack region. While for other five regions, even a coarse page-level redundancy data detecting method can figure out most of the duplicate memory contents.

Improvement of Address Pointer Assignment in DSP Code Generation (DSP용 코드 생성에서 주소 포인터 할당 성능 향상 기법)

  • Lee, Hee-Jin;Lee, Jong-Yeol
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.45 no.1
    • /
    • pp.37-47
    • /
    • 2008
  • Exploitation of address generation units which are typically provided in DSPs plays an important role in DSP code generation since that perform fast address computation in parallel to the central data path. Offset assignment is optimization of memory layout for program variables by taking advantage of the capabilities of address generation units, consists of memory layout generation and address pointer assignment steps. In this paper, we propose an effective address pointer assignment method to minimize the number of address calculation instructions in DSP code generation. The proposed approach reduces the time complexity of a conventional address pointer assignment algorithm with fixed memory layouts by using minimum cost-nodes breaking. In order to contract memory size and processing time, we employ a powerful pruning technique. Moreover our proposed approach improves the initial solution iteratively by changing the memory layout for each iteration because the memory layout affects the result of the address pointer assignment algorithm. We applied the proposed approach to about 3,000 sequences of the OffsetStone benchmarks to demonstrate the effectiveness of the our approach. Experimental results with benchmarks show an average improvement of 25.9% in the address codes over previous works.

A Study on the Automatic Design of Content Addressable Memory (연상 메모리의 자동설계에 관한 연구)

  • 김종선;백인천;박노경;차균현
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.15 no.10
    • /
    • pp.857-867
    • /
    • 1990
  • Sine CAM structure is regular structure as that of RAM of PLA, CAM generator program is easy to implement. This program outputs CAM layout in the form of CIF(Caltech Intermediate Format) and graphic display program is debugging or displaying CAM generator output, which are implemented on PC/AT with MS C(5.0) graphic library and C language. CIF parser is programmed with YACC(Yet Another Compiler Compiler) and LEX (Lexical Analyzer) in order to flat the CIF data. For the purposed of plotting the layout output using ROLAND XY plotter is developed. By combining these program described above, from CIF generation to layout plotting can be executed on pull-down menu according to user's option.

  • PDF

KUIC_LED II : Development of IC Layout Editor using Overlay (KUIC_LED II : Overlay 방법을 사용한 집적회로 Layout Editor의 개발)

  • Jeong, Gab-Jung;Lee, Won;Kweon, Gyu-Wan;Jong-Hoon, Kang;Dae-Hwan, Kim;Ho-Sun, Chung;Wu-Il, Lee
    • Proceedings of the KIEE Conference
    • /
    • 1988.07a
    • /
    • pp.549-552
    • /
    • 1988
  • KUIC_LED II is a two dimensional graphics editor for IC mask layout which breaks through the memory limit, maximum number of box is 2333, with overlay. It runs on IBM PC/AT with the ${\Omega}$/PC color graphics board. I/O data format is CIF(Caltech Intermediate Form). It is written, In 'C' language on MS_DOS.

  • PDF