• Title/Summary/Keyword: multiply and accumulate

Search Result 23, Processing Time 0.024 seconds

Design of a RISC Processor with an Efficient Processing Unit for Multimedia Data (효율적인 멀티미디어데이터 처리를 위한 RISC Processor의 설계)

  • 조태헌;남기훈;김명환;이광엽
    • Proceedings of the IEEK Conference
    • /
    • 2003.07b
    • /
    • pp.867-870
    • /
    • 2003
  • 본 논문은 멀티미디어 데이터 처리를 위한 효율적인 RISC 프로세서 유닛의 설계를 목표로 Vector 프로세서의 SIMD(Single Instruction Multiple Data) 개념을 바탕으로 고정된 연산기 데이터 비트 수에 비해 상대적으로 작은 비트수의 데이터 연산의 부분 병렬화를 통하여 멀티미디어 데이터 연산의 기본이 되는 곱셈누적(MAC : Multiply and Accumulate) 연산의 성능을 향상 시킨다. 또한 기존의 MMX나 VIS 등과 같은 범용 프로세서들의 부분 병렬화를 위해 전 처리 과정의 필요충분조건인 데이터의 연속성을 위해 서로 다른 길이의 데이터 흑은 비트 수가 작은 멀티미디어의 데이터를 하나의 데이터로 재처리 하는 재정렬 혹은 Packing/Unpacking 과정이 성능 전체적인 성능 저하에 작용하게 되므로 본 논문에서는 기존의 프로세서의 연산기 구조를 재이용하여 병렬 곱셈을 위한 연산기 구조를 구현하고 이를 위한 데이터 정렬 연산 구조를 제안한다.

  • PDF

Low Power ADC Design for Mixed Signal Convolutional Neural Network Accelerator (혼성신호 컨볼루션 뉴럴 네트워크 가속기를 위한 저전력 ADC설계)

  • Lee, Jung Yeon;Asghar, Malik Summair;Arslan, Saad;Kim, HyungWon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.11
    • /
    • pp.1627-1634
    • /
    • 2021
  • This paper introduces a low-power compact ADC circuit for analog Convolutional filter for low-power neural network accelerator SOC. While convolutional neural network accelerators can speed up the learning and inference process, they have drawback of consuming excessive power and occupying large chip area due to large number of multiply-and-accumulate operators when implemented in complex digital circuits. To overcome these drawbacks, we implemented an analog convolutional filter that consists of an analog multiply-and-accumulate arithmetic circuit along with an ADC. This paper is focused on the design optimization of a low-power 8bit SAR ADC for the analog convolutional filter accelerator We demonstrate how to minimize the capacitor-array DAC, an important component of SAR ADC, which is three times smaller than the conventional circuit. The proposed ADC has been fabricated in CMOS 65nm process. It achieves an overall size of 1355.7㎛2, power consumption of 2.6㎼ at a frequency of 100MHz, SNDR of 44.19 dB, and ENOB of 7.04bit.

Cascade CNN with CPU-FPGA Architecture for Real-time Face Detection (실시간 얼굴 검출을 위한 Cascade CNN의 CPU-FPGA 구조 연구)

  • Nam, Kwang-Min;Jeong, Yong-Jin
    • Journal of IKEEE
    • /
    • v.21 no.4
    • /
    • pp.388-396
    • /
    • 2017
  • Since there are many variables such as various poses, illuminations and occlusions in a face detection problem, a high performance detection system is required. Although CNN is excellent in image classification, CNN operatioin requires high-performance hardware resources. But low cost low power environments are essential for small and mobile systems. So in this paper, the CPU-FPGA integrated system is designed based on 3-stage cascade CNN architecture using small size FPGA. Adaptive Region of Interest (ROI) is applied to reduce the number of CNN operations using face information of the previous frame. We use a Field Programmable Gate Array(FPGA) to accelerate the CNN computations. The accelerator reads multiple featuremap at once on the FPGA and performs a Multiply-Accumulate (MAC) operation in parallel for convolution operation. The system is implemented on Altera Cyclone V FPGA in which ARM Cortex A-9 and on-chip SRAM are embedded. The system runs at 30FPS with HD resolution input images. The CPU-FPGA integrated system showed 8.5 times of the power efficiency compared to systems using CPU only.

Design of New DSP Instructions and Their Hardware Architecture for High-Speed FFT (고속 FFT 연산을 위한 새로운 DSP 명령어 및 하드웨어 구조 설계)

  • Lee, Jae-Sung;Sunwoo, Myung-Hoon
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.39 no.11
    • /
    • pp.62-71
    • /
    • 2002
  • This paper presents new DSP (Digital Signal Processor) instructions and their hardware architecture for high-speed FFT. the instructions perform new operation flows, which are different from the MAC (Multiply and Accumulate) operation on which existing DSP chips heavily depend. The proposed DPU (Data Processing Unit) supporting the instructions shows two times faster than existing DSP chips for FFT. The architecture has been modeled by the Verilog HDL and logic synthesis has been performed using the 0.35 ${\mu}m$ standard cell library. The maximum operating clock frequency is about 144.5 MHz.

SRAM-Based Area-Efficient Computing-in-Memory for AI Edge Devices (AI 엣지 디바이스를 위한 SRAM 기반 면적 효율적인 컴퓨팅 인 메모리)

  • Hyun-Ki Hong;Sung-Hun Jo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.19 no.5
    • /
    • pp.1051-1058
    • /
    • 2024
  • In semiconductors for AI, Computing in Memory (CIM) integrates computation and memory to minimize data movement and reduce processing bottlenecks, thereby improving performance. In AI tasks that handle large amounts of data, CIM is gaining attention as a key technology that optimizes the performance of AI systems by improving power efficiency and enabling faster computation. In this paper, a new CIM architecture for AI semiconductors is proposed. The proposed architecture can perform MAC operations by controlling the width of the transistor and the pulse width of the control signal, and can be implemented in a smaller area than the existing architecture.

AE32000B: a Fully Synthesizable 32-Bit Embedded Microprocessor Core

  • Kim, Hyun-Gyu;Jung, Dae-Young;Jung, Hyun-Sup;Choi, Young-Min;Han, Jung-Su;Min, Byung-Gueon;Oh, Hyeong-Cheol
    • ETRI Journal
    • /
    • v.25 no.5
    • /
    • pp.337-344
    • /
    • 2003
  • In this paper, we introduce a fully synthesizable 32-bit embedded microprocessor core called the AE32000B. The AE32000B core is based on the extendable instruction set computer architecture, so it has high code density and a low memory access rate. In order to improve the performance of the core, we developed and adopted various design options, including the load extension register instruction (LERI) folding unit, a high performance multiply and accumulate (MAC) unit, various DSP units, and an efficient coprocessor interface. The instructions per cycle count of the Dhrystone 2.1 benchmark for the designed core is about 0.86. We verified the synthesizability and the area and time performances of our design using two CMOS standard cell libraries: a 0.35-${\mu}m$ library and a 0.18-${\mu}m$ library. With the 0.35-${\mu}m$ library, the core can be synthesized with about 47,000 gates and operate at 70 MHz or higher, while it can be synthesized with about 53,000 gates and operate at 120 MHz or higher with the 0.18-${\mu}m$ library.

  • PDF

IEEE-754 Floating-Point Divider for Embedded Processors (내장형 프로세서를 위한 IEEE-754 고성능 부동소수점 나눗셈기의 설계)

  • Jeong, Jae-Won;Hong, In-Pyo;Jeong, Woo-Kyong;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.39 no.7
    • /
    • pp.66-73
    • /
    • 2002
  • As floating-point operations become widely used in various applications such as computer graphics and high-definition DSP, the needs for fast division become increased. However, conventional floating-point dividers occupy a large hardware area, and bring bottle-becks to the entire floating-point operations. In this paper, a high-performance and small-area floating-point divider, which is suitable for embedded processors, is designed using he series expansion algorithm. The algorithm is selected to utilize two MAC(Multiply-ACcumulate) units for quadratic convergence to the correct quotient. The two MAC units for SIMD-DSP features are shared and the additional area for the division only is very small. The proposed divider supports all rounding modes defined by IEEE 754 standard, and error estimations are performed for appropriate precision.

Design of Bit Manipulation Accelerator fo Communication DSP (통신용 DSP를 위한 비트 조작 연산 가속기의 설계)

  • Jeong Sug H.;Sunwoo Myung H.
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.42 no.8 s.338
    • /
    • pp.11-16
    • /
    • 2005
  • This paper proposes a bit manipulation accelerator (BMA) having application specific instructions, which efficiently supports scrambling, convolutional encoding, puncturing, and interleaving. Conventional DSPs cannot effectively perform bit manipulation functions since かey have multiply accumulate (MAC) oriented data paths and word-based functions. However, the proposed accelerator can efficiently process bit manipulation functions using parallel shift and Exclusive-OR (XOR) operations and bit jnsertion/extraction operations on multiple data. The proposed BMA has been modeled by VHDL and synthesized using the SEC $0.18\mu m$ standard cell library and the gate count of the BMA is only about 1,700 gates. Performance comparisons show that the number of clock cycles can be reduced about $40\%\sim80\%$ for scrambling, convolutional encoding and interleaving compared with existing DSPs.

A Performance Evaluation of a RISC-Based Digital Signal Processor Architecture (RISC 기반 DSP 프로세서 아키텍쳐의 성능 평가)

  • Kang, Ji-Yang;Lee, Jong-Bok;Sung, Won-Yong
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.36C no.2
    • /
    • pp.1-13
    • /
    • 1999
  • As the complexity of DSP (Digital Signal Processing) applications increases, the need for new architectures supporting efficient high-level language compilers also grows. By combining several DSP processor specific features, such as single cycle MAC (Multiply-and-ACcumulate), direct memory access, automatic address generation, and hardware looping, with a RISC core having many general purpose registers and orthogonal instructions, a high-performance and compiler-friendly RISC-based DSP processors can be designed. In this study, we develop a code-converter that can exploit these DSP architectural features by post-processing compiler-generated assembly code, and evaluate the performance effects of each feature using seven DSP-kernel benchmarks and a QCELP vocoder program. Finally, we also compare the performances with several existing DSP processors, such as TMS320C3x, TMS320C54x, and TMS320C5x.

  • PDF

Performance Comparison of DCT Algorithm Implementations Based on Hardware Architecture (프로세서 구조에 따른 DCT 알고리즘의 구현 성능 비교)

  • Lee Jae-Seong;Pack Young-Cheol;Youn Dae-Hee
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.31 no.6C
    • /
    • pp.637-644
    • /
    • 2006
  • This paper presents performance and implementation comparisons of standard and fast DCT algorithms that are commonly used for subband filter bank in MPEG audio coders. The comparison is made according to the architectural difference of the implementation hardware. Fast DCT algorithms are known to have much less computational complexity than the standard method that involves computing a vector dot product of cosine coefficient. But, due to structural irregularity, fast DCT algorithms require extra cycles to generate the addresses for operands and to realign interim data. When algorithms are implemented using DSP processors that provide special operations such as single-cycle MAC (multiply-accumulate), zero-overhead nested loop, the standard algorithm is more advantageous than the fast algorithms. Also, in case of the finite-precision processing, the error performance of the standard method is far superior to that of the fast algorithms. In this paper, truncation errors and algorithmic suitability are analyzed and implementation results are provided to support the analysis.