• Title/Summary/Keyword: instruction set architecture

Search Result 88, Processing Time 0.026 seconds

Design of an Asynchronous Instruction Cache based on a Mixed Delay Model (혼합 지연 모델에 기반한 비동기 명령어 캐시 설계)

  • Jeon, Kwang-Bae;Kim, Seok-Man;Lee, Je-Hoon;Oh, Myeong-Hoon;Cho, Kyoung-Rok
    • The Journal of the Korea Contents Association
    • /
    • v.10 no.3
    • /
    • pp.64-71
    • /
    • 2010
  • Recently, to achieve high performance of the processor, the cache is splits physically into two parts, one for instruction and one for data. This paper proposes an architecture of asynchronous instruction cache based on mixed-delay model that are DI(delay-insensitive) model for cache hit and Bundled delay model for cache miss. We synthesized the instruction cache at gate-level and constructed a test platform with 32-bit embedded processor EISC to evaluate performance. The cache communicates with the main memory and CPU using 4-phase hand-shake protocol. It has a 8-KB, 4-way set associative memory that employs Pseudo-LRU replacement algorithm. As the results, the designed cache shows 99% cache hit ratio and reduced latency to 68% tested on the platform with MI bench mark programs.

Dual-mode Pseudorandom Number Generator Extension for Embedded System (임베디드 시스템에 적합한 듀얼 모드 의사 난수 생성 확장 모듈의 설계)

  • Lee, Suk-Han;Hur, Won;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.8
    • /
    • pp.95-101
    • /
    • 2009
  • Random numbers are used in many sorts of applications. Some applications, like simple software simulation tests, communication protocol verifications, cryptography verification and so forth, need various levels of randomness with various process speeds. In this paper, we propose a fast pseudorandom generator module for embedded systems. The generator module is implemented in hardware which can run in two modes, one of which can generate random numbers with higher randomness but which requires six cycles, the other providing its result within one cycle but with less randomness. An ASIP (Application Specific Instruction set Processor) was designed to implement the proposed pseudorandom generator instruction sets. We designed a processor based on the MIPS architecture,, by using LISA, and have run statistical tests passing the sequence of the Diehard test suite. The HDL models of the processor were generated using CoWare's Processor Designer and synthesized into the Dong-bu 0.18um CMOS cell library using the Synopsys Design Compiler. With the proposed pseudorandom generator module, random number generation performance was 239% faster than software model, but the area increased only 2.0% of the proposed ASIP.

Regular Expression Matching Processor Architecture Supporting Character Class Matching (문자클래스 매칭을 지원하는 정규표현식 매칭 프로세서 구조)

  • Yun, SangKyun
    • Journal of KIISE
    • /
    • v.42 no.10
    • /
    • pp.1280-1285
    • /
    • 2015
  • Many hardware-based regular expression matching architectures are proposed for high performance matching. In particular, regular expression processors such as ReCPU and SMPU perform pattern matching in a similar approach to that used in general purpose processors, which provide the flexibility when updating patterns. However, these processors are inefficient in performing class matching since they do not provide character class matching capabilities. This paper proposes an instruction set and architecture of a regular expression matching processor, which can support character class matching. The proposed processor can efficiently perform character class matching since it includes character class, character range, and negated character class matching capabilities.

A Parallel-Architecture Processor Design for the Fast Multiplication of Homogeneous Transformation Matrices (Homogeneous Transformation Matrix의 곱셈을 위한 병렬구조 프로세서의 설계)

  • Kwon Do-All;Chung Tae-Sang
    • The Transactions of the Korean Institute of Electrical Engineers D
    • /
    • v.54 no.12
    • /
    • pp.723-731
    • /
    • 2005
  • The $4{\times}4$ homogeneous transformation matrix is a compact representation of orientation and position of an object in robotics and computer graphics. A coordinate transformation is accomplished through the successive multiplications of homogeneous matrices, each of which represents the orientation and position of each corresponding link. Thus, for real time control applications in robotics or animation in computer graphics, the fast multiplication of homogeneous matrices is quite demanding. In this paper, a parallel-architecture vector processor is designed for this purpose. The processor has several key features. For the accuracy of computation for real application, the operands of the processors are floating point numbers based on the IEEE Standard 754. For the parallelism and reduction of hardware redundancy, the processor takes column vectors of homogeneous matrices as multiplication unit. To further improve the throughput, the processor structure and its control is based on a pipe-lined structure. Since the designed processor can be used as a special purpose coprocessor in robotics and computer graphics, additionally to special matrix/matrix or matrix/vector multiplication, several other useful instructions for various transformation algorithms are included for wide application of the new design. The suggested instruction set will serve as standard in future processor design for Robotics and Computer Graphics. The design is verified using FPGA implementation. Also a comparative performance improvement of the proposed design is studied compared to a uni-processor approach for possibilities of its real time application.

An Architecture of Vector Processor Concept using Dimensional Counting Mechanism of Structured Data (구조성 데이터의 입체식 계수기법에 의한 벡터 처리개념의 설계)

  • Jo, Yeong-Il;Park, Jang-Chun
    • The Transactions of the Korea Information Processing Society
    • /
    • v.3 no.1
    • /
    • pp.167-180
    • /
    • 1996
  • In the scalar processing oriented machine scalar operations must be performed for the vector processing as many as the number of vector components. So called a vector processing mechanism by the von Neumann operational principle. Accessing vector data hasto beperformed by theevery pointing ofthe instruction or by the address calculation of the ALU, because there is only a program counter(PC) for the sequential counting of the instructions as a memory accessing device. It should be here proposed that an access unit dimensionally to address components has to be designed for the compensation of the organizational hardware defect of the conventional concept. The necessity for the vector structuring has to be implemented in the instruction set and be performed in the mid of the accessing data memory overlapped externally to the data processing unit at the same time.

  • PDF

Design of a Dingle-chip Multiprocessor with On-chip Learning for Large Scale Neural Network Simulation (대규모 신경망 시뮬레이션을 위한 칩상 학습가능한 단일칩 다중 프로세서의 구현)

  • 김종문;송윤선;김명원
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.33B no.2
    • /
    • pp.149-158
    • /
    • 1996
  • In this paper we describe designing and implementing a digital neural chip and a parallel neural machine for simulating large scale neural netsorks. The chip is a single-chip multiprocessor which has four digiral neural processors (DNP-II) of the same architecture. Each DNP-II has program memory and data memory, and the chip operates in MIMD (multi-instruction, multi-data) parallel processor. The DNP-II has the instruction set tailored to neural computation. Which can be sed to effectively simulate various neural network models including on-chip learning. The DNP-II facilitates four-way data-driven communication supporting the extensibility of parallel systems. The parallel neural machine consists of a host computer, processor boards, a buffer board and an interface board. Each processor board consists of 8*8 array of DNP-II(equivalently 2*2 neural chips). Each processor board acn be built including linear array, 2-D mesh and 2-D torus. This flexibility supports efficiency of mapping from neural network models into parallel strucgure. The neural system accomplishes the performance of maximum 40 GCPS(giga connection per second) with 16 processor boards.

  • PDF

Design of an Image Processing ASIC Architecture using Parallel Approach with Zero or Little (통신부담을 감소시킨 영상처리를 위한 병렬처리 방식 ASIC구조 설계)

  • 안병덕;정지원;선우명훈
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.10
    • /
    • pp.2043-2052
    • /
    • 1994
  • This paper proposes a new parallel ASIC architecture for real-time image processing to reduce inter-processing element (inter-PE) communication overhead, called a Sliding Memory Plane (SliM) Image Processor. The Slim Image Processor consists of $3\times3$ processing elements (PEs) connected by a mesh topology. With easy scalability due to the topology. a set of SliM Image Processors can form a mesh-connected SIMD parallel architecture. called the SliM Array Processor. The idea of sliding means that all pixels are slided into all neighboring PEs without interrupting PEs and without a coprocessor or a DMA controller. Since the inter-PE communication and computation occur simultaneously. the inter-PE communication overhead, significant disadvantage of existing machines greatly diminishes. Two I/O planes provide a buffering capability and reduce the date I/O overhead. In addition, using the by-passing path provides eight-way connectivity even with four links. with these salient features. SliM shows a significant performance improvement. This paper presents architectures of a PE and the SliM Image Processor, and describes the design of an instruction set.

  • PDF

Design of a scalable general-purpose parallel associative processor using content-addressable memory (Content-Addressable Memory를 이용한 확장 가능한 범용 병렬 Associative Processor 설계)

  • Park, Tae-Geun
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.43 no.2 s.344
    • /
    • pp.51-59
    • /
    • 2006
  • Von Neumann architecture suffers from the interface between the central processing unit and the memory, which is called 'Von Neumann bottleneck' In this paper, we propose a scalable general-purpose associative processor (AP) based on content-addressable memory (CAM) which solves this problem and is suitable for the search-oriented applications. We propose an efficient instruction set and a structural scalability to extend for larger applications. We define twelve instructions and provide some reduced instructions to speed up which execute two instructions in a single instruction cycle. The proposed AP performs in a bit-serial, word-parallel fashion and can be considered as a 32-bit general-purpose parallel processor with a massively parallel SIMD structure. We design and simulate a maximum/minumum search greater-than/less-than search, and parallel addition to verify the proposed architecture. The algorithms are executed in a constant time O(k) regardless of the number of input data.

Ultra-low-power DSP for Audio Signal Processing (오디오 신호 처리를 위한 초저전력 DSP 프로세서)

  • Kwon, Kiseok;Ahn, Minwook;Jo, Seokhwan;Lee, Yeonbok;Lee, Seungwon;Park, Young-Hwan;Kim, Sukjin;Kim, Do-Hyung;Kim, Jaehyun
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2014.06a
    • /
    • pp.157-159
    • /
    • 2014
  • In this paper, we introduce SlimSRP, an ultra-low-power digital signal processor (DSP) solution for mobile audio and voice applications. So far, application processors (APs) have taken charge of all the tasks in mobile devices. However, they have suffered from short battery life problems to deal with complex usage scenarios, such as always-on voice trigger with continuous audio playback. From extensive analysis of audio and voice application characteristics, SlimSRP is designed to relive the performance and power burden of APs. It employs three-issue VLIW architecture, and the major low-power and high-performance techniques include: (1) an optimized register-file architecture friendly for constants generation, (2) a powerful instruction set to reduce the number of register file accesses and (3) a unique instruction compression scheme that contributes to saved memory size and reduced cache miss. An implementation of SlimSRP runs at up to 200MHz and the logic occupies 95K NAND2 gates in Samsung 28LPP process. The experimental results demonstrate that a MP3 decoder application with a 128kbps 44.1kHz input can run at 5.1MHz and the logic consumes only 22uW/MHz.

  • PDF

Design and Evaluation of 32-Bit RISC-V Processor Using FPGA (FPGA를 이용한 32-Bit RISC-V 프로세서 설계 및 평가)

  • Jang, Sungyeong;Park, Sangwoo;Kwon, Guyun;Suh, Taeweon
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.11 no.1
    • /
    • pp.1-8
    • /
    • 2022
  • RISC-V is an open-source instruction set architecture which has a simple base structure and can be extensible depending on the purpose. In this paper, we designed a small and low-power 32-bit RISC-V processor to establish the base for research on RISC-V embedded systems. We designed a 2-stage pipelined processor which supports RISC-V base integer instruction set except for FENCE and EBREAK instructions. The processor also supports privileged ISA for trap handling. It used 1895 LUTs and 1195 flip-flops, and consumed 0.001W on Xilinx Zynq-7000 FPGA when synthesized using Vivado Design Suite. GPIO, UART, and timer peripherals are additionally used to compose the system. We verified the operation of the processor on FPGA with FreeRTOS at 16MHz. We used Dhrystone and Coremark benchmarks to measure the performance of the processor. This study aims to provide a low-power, high-efficiency microprocessor for future extension.