• Title/Summary/Keyword: Memory Reference Instruction

Search Result 11, Processing Time 0.027 seconds

Comparison of Traditional Workloads and Deep Learning Workloads in Memory Read and Write Operations

  • Jeongha Lee;Hyokyung Bahn
    • International journal of advanced smart convergence
    • /
    • v.12 no.4
    • /
    • pp.164-170
    • /
    • 2023
  • With the recent advances in AI (artificial intelligence) and HPC (high-performance computing) technologies, deep learning is proliferated in various domains of the 4th industrial revolution. As the workload volume of deep learning increasingly grows, analyzing the memory reference characteristics becomes important. In this article, we analyze the memory reference traces of deep learning workloads in comparison with traditional workloads specially focusing on read and write operations. Based on our analysis, we observe some unique characteristics of deep learning memory references that are quite different from traditional workloads. First, when comparing instruction and data references, instruction reference accounts for a little portion in deep learning workloads. Second, when comparing read and write, write reference accounts for a majority of memory references, which is also different from traditional workloads. Third, although write references are dominant, it exhibits low reference skewness compared to traditional workloads. Specifically, the skew factor of write references is small compared to traditional workloads. We expect that the analysis performed in this article will be helpful in efficiently designing memory management systems for deep learning workloads.

Scalar First Replacement Strategy for Reference Prediction Table Used in Prefetching Streaming Data (스트리밍 데이터의 선인출에 사용되는 참조예측표의 스칼라 우선 교체 전략)

  • Lim, Chul-hoo;Chon, Young-Suk;Kim, Suk-il;Jeon, Joong-nam
    • The KIPS Transactions:PartA
    • /
    • v.11A no.3
    • /
    • pp.163-172
    • /
    • 2004
  • Multimedia applications tend to access their data as a streaming pattern with regular intervals. This characteristic can be utilized in prefetching the multimedia data into cache memory so as to reduce their execution speeds. The reference-prediction prefetch algorithm predicts the memory address that seems to be used in the next time based on the previous history of memory references stored in the prediction reference table. This paper proposes a strategy to manipulate the reference prediction table which contains all of the data reference instructions to scalar and streaming data. We have recognized that the scalar reference instructions do not contribute to the data prefetching algorithm. Therefore, when replacing an element in the reference prediction table, the proposed algorithm preferentially selects the scalar reference instruction before the stream reference instruction. It makes the stream reference instruction to stay for a long time compared to the FIFO replacement policy, and eventually improves the performance of data prefetching.

Design of a DI model-based Content Addressable Memory for Asynchronous Cache

  • Battogtokh, Jigjidsuren;Cho, Kyoung-Rok
    • International Journal of Contents
    • /
    • v.5 no.2
    • /
    • pp.53-58
    • /
    • 2009
  • This paper presents a novel approach in the design of a CAM for an asynchronous cache. The architecture of cache mainly consists of four units: control logics, content addressable memory, completion signal logic units and instruction memory. The pseudo-DCVSL is useful to make a completion signal which is a reference for handshake control. The proposed CAM is a very simple extension of the basic circuitry that makes a completion signal based on DI model. The cache has 2.75KB CAM for 8KB instruction memory. We designed and simulated the proposed asynchronous cache including CAM. The results show that the cache hit ratio is up to 95% based on pseudo-LRU replacement policy.

Low-Power IoT Microcontroller Code Memory Interface using Binary Code Inversion Technique Based on Hot-Spot Access Region Detection (핫스팟 접근영역 인식에 기반한 바이너리 코드 역전 기법을 사용한 저전력 IoT MCU 코드 메모리 인터페이스 구조 연구)

  • Park, Daejin
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.97-105
    • /
    • 2016
  • Microcontrollers (MCUs) for endpoint smart sensor devices of internet-of-thing (IoT) are being implemented as system-on-chip (SoC) with on-chip instruction flash memory, in which user firmware is embedded. MCUs directly fetch binary code-based instructions through bit-line sense amplifier (S/A) integrated with on-chip flash memory. The S/A compares bit cell current with reference current to identify which data are programmed. The S/A in reading '0' (erased) cell data consumes a large sink current, which is greater than off-current for '1' (programmed) cell data. The main motivation of our approach is to reduce the number of accesses of erased cells by binary code level transformation. This paper proposes a built-in write/read path architecture using binary code inversion method based on hot-spot region detection of instruction code access to reduce sensing current in S/A. From the profiling result of instruction access patterns, hot-spot region of an original compiled binary code is conditionally inverted with the proposed bit-inversion techniques. The de-inversion hardware only consumes small logic current instead of analog sink current in S/A and it is integrated with the conventional S/A to restore original binary instructions. The proposed techniques are applied to the fully-custom designed MCU with ARM Cortex-M0$^{TM}$ using 0.18um Magnachip Flash-embedded CMOS process and the benefits in terms of power consumption reduction are evaluated for Dhrystone$^{TM}$ benchmark. The profiling environment of instruction code executions is implemented by extending commercial ARM KEIL$^{TM}$ MDK (MCU Development Kit) with our custom-designed access analyzer.

A Hardware Cache Prefetching Scheme for Multimedia Data with Intermittently Irregular Strides (단속적(斷續的) 불규칙 주소간격을 갖는 멀티미디어 데이타를 위한 하드웨어 캐시 선인출 방법)

  • Chon Young-Suk;Moon Hyun-Ju;Jeon Joongnam;Kim Sukil
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.11
    • /
    • pp.658-672
    • /
    • 2004
  • Multimedia applications are required to process the huge amount of data at high speed in real time. The memory reference instructions such as loads and stores are the main factor which limits the high speed execution of processor. To enhance the memory reference speed, cache prefetch schemes are used so as to reduce the cache miss ratio and the total execution time by previously fetching data into cache that is expected to be referenced in the future. In this study, we present an advanced data cache prefetching scheme that improves the conventional RPT (reference prediction table) based scheme. We considers the cache line size in calculation of the address stride referenced by the same instruction, and enhances the prefetching algorithm so that the effect of prefetching could be maintained even if an irregular address stride is inserted into the series of uniform strides. According to experiment results on multimedia benchmark programs, the cache miss ratio has been improved 29% in average compared to the conventional RPT scheme while the bus usage has increased relatively small amount (0.03%).

Compiler Optimization Techniques for The Next Generation Low Power Multibank Memory (차세대 저전력 멀티뱅크 메모리를 위한 컴파일러 최적화 기법)

  • Cho, Doosan
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.6
    • /
    • pp.141-145
    • /
    • 2021
  • Various types of memory architectures have been developed, and various compiler optimization techniques have been studied to efficiently use them. In particular, since a memory is a major component that determines performance in mobile computing devices, various optimization techniques have been developed to support them. Recently, a lot of research on hybrid type memory architecture is being conducted, so various compiler techniques are being studied to support it. Existing compiler optimization techniques can be used to achieve the required minimum performance and constraint on low power according to market requirements. References for determining the low-power effect and the degree of performance improvement using these optimization techniques are not properly provided yet. This study was conducted to provide the experimental results of the existing compiler technique as a reference for the development of multibank memory architecture.

Design of Virtual Machine for Vertex Shader (정점 셰이더의 가상 기계 구현)

  • Ha, Chang-Soo;Kim, Ju-Hong;Choi, Byeong-Yoon
    • Proceedings of the IEEK Conference
    • /
    • 2005.11a
    • /
    • pp.1003-1006
    • /
    • 2005
  • Vertex shader of GPU in personal computer is advanced in functions as to be half of traditional fixed T&L functions. And, capacity of memory for saving resources to process instructions is unlimited. GPU that can be programmed by programmer is needed for mobile system as well as personal computer. In this paper, we implement software virtual machine for vertex shader using C++ Language. Our goal is designing hardware GPU that can apply to mobile system. The virtual machine consists of nVidia GPU instructions. Input Data to virtual machine is generated by Microsoft fxc compiler. That is to say, Input Data is compiled shader program written in HLSL, Cg, or ASM. The virtual machine will be a reference model for designing hardware GPU and can be used for Testbed to test added or modified instruction.

  • PDF

DEVELOPMENT OF CFD PROGRAM FOR THE CONJUGATE HEAT TRANSFER ANALYSIS OF PMSM ELECTRIC MOTOR (PMSM 전동기 모터의 복합 열전달 해석을 위한 CFD 프로그램 개발)

  • Lee, Jung-Hee;Choi, Jong-Rak;Hur, Nahm-Keon;Kim, Joo-Han;Kim, Young-Kyoun
    • 한국전산유체공학회:학술대회논문집
    • /
    • 2011.05a
    • /
    • pp.488-493
    • /
    • 2011
  • The object of this study is to develope the program for analyzing the fluid flow and heat transfer of PMSM electric motor. The program will be mainly used for inexperienced users of CFD analysis. So it has to be performed using the geometry data and the heat source of each part only. Interface program for converting the given data to the instruction of pre-processor is developed. The conjugate heat transfer between a flow passage of the motor and inner parts consisting of rotor and stator is regarded. In order to reduce the computational time and memory storage, cyclic boundary condition is applied. For the numerical simulation, MRF(Multi-Reference Frame) method is used to consider rotating operation of the rotor and heat source is applied to the copper, wire, and magnetic parts in the motor. On the screen of computer, the users can show the velocity distributions and the contours such as pressure, turbulent kinetic energy, turbulent dissipation rate and temperature.

  • PDF

An Energy Efficient and High Performance Data Cache Structure Utilizing Tag History of Cache Addresses (캐시 주소의 태그 이력을 활용한 에너지 효율적 고성능 데이터 캐시 구조)

  • Moon, Hyun-Ju;Jee, Sung-Hyun
    • The KIPS Transactions:PartA
    • /
    • v.14A no.1 s.105
    • /
    • pp.55-62
    • /
    • 2007
  • Uptime of embedded processors for mobile devices are dependent on battery consumption. Especially the large portion of power consumption is known to be due to cache management in embedded processors. This paper proposes an energy efficient data cache structure for high performance embedded processors. High performance prefetching data cache issues prefetching instructions before issuing demand-fetch instructions based on reference predictions. These prefetching instruction bring reduction on memory delay by improving cache hit ratio, but on the other hand those increase energy consumption in proportion to the number of prefetching instructions. In this paper, we adopt tag history table on prefetching data cache for reducing energy consumption by minimizing parallel tag comparison. Experimental results show the proposed data cache improves performance on energy consumption as well as memory delay.

The Design and implementation of parallel processing system using the $Nios^{(R)}$ II embedded processor ($Nios^{(R)}$ II 임베디드 프로세서를 사용한 병렬처리 시스템의 설계 및 구현)

  • Lee, Si-Hyun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.14 no.11
    • /
    • pp.97-103
    • /
    • 2009
  • In this thesis, we discuss the implementation of parallel processing system which is able to get a high degree of efficiency(size, cost, performance and flexibility) by using $Nios^{(R)}$ II(32bit RISC(Reduced Instruction Set Computer) processor) embedded processor in DE2-$70^{(R)}$ reference board. The designed Parallel processing system is master-slave, shared memory and MIMD(Mu1tiple Instruction-Multiple Data stream) architecture with 4-processor. For performance test of system, N-point FFT is used. The result is represented speed-up as follow; in the case of using 2-processor(core), speed-up is shown as average 1.8 times as 1-processor's. When 4-processor, the speed-up is shown as average 2.4 times as it's.