• Title/Summary/Keyword: multiply and accumulate

Search Result 21, Processing Time 0.021 seconds

Design and Implementation of a Power Aware Scalable Pipelined Booth Multiply & Accumulate Unit (소비전력 인지형 곱셈 연산 누적기의 설계 및 구현)

  • Shin, Min-Hyuk;Lee, Han-Ho
    • Proceedings of the IEEK Conference
    • /
    • 2006.06a
    • /
    • pp.573-574
    • /
    • 2006
  • A low-power power-aware scalable pipelined Booth recoded multiply & Accumulate unit (PA-MAC) detects the input operands for their dynamic range and accordingly implements a 16-bit, 8-bit or 4-bit multiplication and accumulation operation. The multiplication mode is determined by the dynamic - range detection unit. For the computations, although an area of the proposed PA-MAC is lager than a non-scalable MAC respectively, the proposed PA-MAC proves to be globally more power efficient than a non-scalable MAC.

  • PDF

SIMD Multiply-accumulate Unit Design for Multimedia Data Processing (멀티미디어 처리에 적합한 SIMD 곱셈누적 연산기의 설계)

  • 홍인표;정재원;정우경;이용석
    • Proceedings of the IEEK Conference
    • /
    • 2000.11b
    • /
    • pp.349-352
    • /
    • 2000
  • In this paper, a SIMD 64bit MAC (Multiply -Accumulate) unit is designed. It is composed of two 32bit MAC unit which supports SIMD 16bit operations. As a result, It can process two 32bit MAC operations or four 16bit operations in one cycle. Proposed MAC unit is described in Verilog HDL. After functional verification is performed, MAC unit is synthesized and optimized with 0.35$\mu\textrm{m}$ standard cell library. The synthesis result shows that this MAC unit can operate at 80㎒ of clock frequency in 85$^{\circ}C$, 3.0V, worst case process and 125㎒ of clock frequency at 25$^{\circ}C$, 3.3V, typical case process. It achieves 320Mops of performance, and is suitable for embedded DSP processors.

  • PDF

The Design of low-cost SIMD MAC/MAS for Embedded Systems (임베디드 시스템을 위한 저비용 SIMD MAC/MAS 블록 설계)

  • Lee Yong Joo;Jung Jin Woo;Lee Yong Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.10C
    • /
    • pp.1460-1468
    • /
    • 2004
  • In this paper, we developed a low-area and low-cost SIMD MAC/MAS(Single Instruction Multiple Data Multiply and ACcumulate/Multiply And Subtract) for multimedia that is used much in real life. We compared the result of this research with a previously developed more large and high performance SIMD MAC/MAS. This paper is consist of 5 parts, which are an introduction, the contents of designing SIMD MAC/MAS hardware, a special qualities for previous works, the result of synthesis and conclusion. The design result reduced by size 32% of whole hardware than 64 bit SIMD MAC/MAS block of designed for high performance. This improved ISA (Instruction Set Architecture) to be suitable to embedded DSP(Digital Signal Processor), and shortened bit range of 64-bit data to 32-bit and implement more optimally.

SIMD MAC Unit Design for Multimedia Data Processing (멀티미디어 데이터 처리에 적합한 SIMD MAC 연산기의 설계)

  • Hong, In-Pyo;Jeong, Woo-Kyong;Jeong Jae-Won;Lee Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.12
    • /
    • pp.44-55
    • /
    • 2001
  • MAC(Multiply and ACcumulate) is the core operation of multimedia data processing. Because MAC units implemented on traditional DSP units or embedded processors have latency of three cycles and cannot operate on multiple data simultaneously, then, performances are seriously limited. Many high end general purpose microprocessors have SIMD MAC unit as a functional unit. But these high end MAC units must support pipeline structure for various operation modes and high clock frequency, which makes control logic complex and increases chip area. In this paper, a 64bit SIMD MAC unit for embedded processors is designed. It is implemented to have a latency of one clock cycle to remove pipeline control logics and a minimal area overhead for SIMD support is added to existing Booth multipliers.

  • PDF

Development of a High-performance DSP Coprocessor Architecture (고성능 32-bit DSP 코프로세서의 아키텍쳐 개발)

  • Yun, Seong-Cheol;Kim, Sang-Uk;Bae, Seong-Il;Gang, Seong-Ho;Kim, Yong-Cheon;Jeong, Seung-Jae;Kim, Sang-U;Mun, Sang-Hun
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.39 no.2
    • /
    • pp.72-81
    • /
    • 2002
  • A new high-performance DSP architecture is proposed, which behaves as a coprocessor of a 32bit microcontroller. Because the proposed DSP architecture is a dual MAC(Multiply and Accumulate) DSP architecture, it can process efficiently a number of SOP(sum of product) operations used in many DSP applications. In order to efficiently perform other operations such as pure additions without any restriction, a MAC is composed of a multiplier and a ALU placed in parallel. In addition, it is a 3-way superscalar architecture, which can issue 3 instructions at a time. The benchmark results with 3 thor dual MAC DSPs show that the proposed DSP has the best performance. Futhermore, it is proven that the proposed DSP is more efficient in memory usage, although the performance is comparable in some algorithms such as Viterbi decoding and FFT butterfly.

The efficient implementation of the multi-channel active noise controller using a low-cost microcontroller unit (저가 microcontoller unit을 이용한 효율적인 다채널 능동 소음 제어기 구현)

  • Chung, Ik Joo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.1
    • /
    • pp.9-22
    • /
    • 2019
  • In this paper, we propose a method that can be applied to the efficient implementation of multi-channel active noise controller. Since the normalized MFxLMS (Modified Filtered-x Least Mean Square) algorithm for the multi-channel active noise control requires a large amount of computation, the difficulty has lied in implementing the algorithm using a low-cost MCU (Microcontoller Unit). We implement the multi-channel active noise controller efficiently by optimizing the software based on the features of the MCU. By maximizing the usage of single-cycle MAC (Multiply- Accumulate) operations and minimizing move operations of the delay memory, we can achieve more than 3 times the performance in the aspect of computational optimization, and by parellel processing using the auxillary processor included in the MCU, we can also obtain more than 4 times the performance. In addition, the usage of additional parts can be minimized by maximizing the usage of the peripherals embedded in the MCU.

A High-Security RSA Cryptoprocessor Embedded with an Efficient MAC Unit

  • Moon, Sang-Ook
    • Journal of information and communication convergence engineering
    • /
    • v.7 no.4
    • /
    • pp.516-520
    • /
    • 2009
  • RSA crypto-processors equipped with more than 1024 bits of key space handle the entire key stream in units of blocks. The RSA processor which will be the target design in this paper defines the length of the basic word as 128 bits, and uses an 256-bits register as the accumulator. For efficient execution of 128-bit multiplication, 32b*32b multiplier was designed and adopted and the results are stored in 8 separate 128-bit registers according to the status flag. In this paper, an efficient method to execute 128-bit MAC (multiplication and accumulation) operation is proposed. The suggested method pre-analyzed the all possible cases so that the MAC unit can remove unnecessary calculations to speed up the execution. The proposed architecture prototype of the MAC unit was automatically synthesized, and successfully operated at 20MHz, which will be the operation frequency in the RSA processor.

A Design of 24-bit Floating Point MAC Unit for Transformation of 3D Graphics (3차원 그래픽의 트랜스포메이션을 위한 24-bit 부동 소수점 MAC 연산기의 설계)

  • Lee, Jungwoo;Kim, Woojin;Kim, Kichul
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.4 no.1
    • /
    • pp.1-8
    • /
    • 2009
  • This paper proposes a 24-bit floating point multiply and accumulate(MAC) unit that can be used in geometry transformation process in 3D graphics. The MAC unit is composed of floating point multiplier and floating point accumulator. When separate multiplier and accumulator are used, matrix calculation, used in the transformation process, can't use continuous accumulation values. In the proposed MAC unit the accumulator can get continuous input from the multiplier and the calculation time is reduced. The MAC unit uses about 4,300 gates and can be operated at 150 MHz frequency.

  • PDF

A DSP Architecture for High-Speed FFT in OFDM Systems

  • Lee, Jae-Sung;Lee, Jeong-Hoo;SunWoo, Myung-H.;Moh, Sang-Man;Oh, Seong-Keun
    • ETRI Journal
    • /
    • v.24 no.5
    • /
    • pp.391-397
    • /
    • 2002
  • This paper presents digital signal processor (DSP) instructions and their data processing unit (DPU) architecture for high-speed fast Fourier transforms (FFTs) in orthogonal frequency division multiplexing (OFDM) systems. The proposed instructions jointly perform new operation flows that are more efficient than the operation flow of the multiply and accumulate (MAC) instruction on which existing DSP chips heavily depend. We further propose a DPU architecture that fully supports the instructions and show that the architecture is two times faster than existing DSP chips for FFTs. We simulated the proposed model with a Verilog HDL, performed a logic synthesis using the 0.35 ${\mu}m$ standard cell library, and then verified the functions thoroughly.

  • PDF

Design of a RISC Processor with an Efficient Processing Unit for Multimedia Data (효율적인 멀티미디어데이터 처리를 위한 RISC Processor의 설계)

  • 조태헌;남기훈;김명환;이광엽
    • Proceedings of the IEEK Conference
    • /
    • 2003.07b
    • /
    • pp.867-870
    • /
    • 2003
  • 본 논문은 멀티미디어 데이터 처리를 위한 효율적인 RISC 프로세서 유닛의 설계를 목표로 Vector 프로세서의 SIMD(Single Instruction Multiple Data) 개념을 바탕으로 고정된 연산기 데이터 비트 수에 비해 상대적으로 작은 비트수의 데이터 연산의 부분 병렬화를 통하여 멀티미디어 데이터 연산의 기본이 되는 곱셈누적(MAC : Multiply and Accumulate) 연산의 성능을 향상 시킨다. 또한 기존의 MMX나 VIS 등과 같은 범용 프로세서들의 부분 병렬화를 위해 전 처리 과정의 필요충분조건인 데이터의 연속성을 위해 서로 다른 길이의 데이터 흑은 비트 수가 작은 멀티미디어의 데이터를 하나의 데이터로 재처리 하는 재정렬 혹은 Packing/Unpacking 과정이 성능 전체적인 성능 저하에 작용하게 되므로 본 논문에서는 기존의 프로세서의 연산기 구조를 재이용하여 병렬 곱셈을 위한 연산기 구조를 구현하고 이를 위한 데이터 정렬 연산 구조를 제안한다.

  • PDF