• Title/Summary/Keyword: Booth Algorithm

Search Result 45, Processing Time 0.027 seconds

A 8192-Point FFT Processor Based on the CORDIC Algorithm for OFDM System (CORDIC 알고리듬에 기반 한 OFDM 시스템용 8192-Point FFT 프로세서)

  • Park, Sang-Yoon;Cho, Nam-Ik
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.8B
    • /
    • pp.787-795
    • /
    • 2002
  • This paper presents the architecture and the implementation of a 2K/4K/8K-point complex Fast Fourier Transform(FFT) processor for Orthogonal Frequency-Division Multiplexing (OFDM) system. The architecture is based on the Cooley-Tukey algorithm for decomposing the long DFT into short length multi-dimensional DFTs. The transposition memory, shuffle memory, and memory mergence method are used for the efficient manipulation of data for multi-dimensional transforms. Booth algorithm and the COordinate Rotation DIgital Computer(CORDIC) processor are employed for the twiddle factor multiplications in each dimension. Also, for the CORDIC processor, a new twiddle factor generation method is proposed to obviate the ROM required for storing the twiddle factors. The overall 2K/4K/8K-FFT processor requires 600,000 gates, and it is implemented in 1.8 V, 0.18 ${\mu}m$ CMOS. The processor can perform 8K-point FFT in every 273 ${\mu}s$, 2K-point every 68.26 ${\mu}s$ at 30MHz, and the SNR is over 48dB, which are enough performances for the OFDM in DVB-T.

A Study On the Design of a Floating Point Unit for MPEG-2 AAC Decoder (MPEG-2 AAC 복호기를 위한 부동소수점유닛 설계에 관한 연구)

  • 구대성;김필중;김종빈
    • Journal of the Institute of Electronics Engineers of Korea TE
    • /
    • v.39 no.4
    • /
    • pp.355-355
    • /
    • 2002
  • In this paper, we designed a FPU(floating point unit) that it is very important and requires of high density when digital audio is designed. Almost audio system must support the multi-channel and required for high quality. A floating point arithmetic function in MPEG-2 AAC that implemented by hardware is able to realtime decoding when DSP realization. The reason is that MPEG-2 AAC is compatible to the Audio field of MPEG-4 and afterwards. We designed a FPU by hardware to increase the speed of a floating point unit with much calculation part in the MPEG-2 AAC Decoder. A FPU is composed of a multiplier and an adder. A multiplier used the Radix-4 Booth algorithm and an adder adopted 1's complement method for speed up. A form of a floating point unit has 8bit of exponent part and 24bit of mantissa. It's compatible with the IEEE single precision format and adopted a pipeline architecture to increase the speed of a processor. All of sub blocks are based on ISO/IEC 13818-7 standard. The algorithm is tested by C language and the design does by use of VHDL(VHSIC Hardware Description Language). The maximum operation speed is 23.2MHz and the stable operation speed is 19MHz.

Asynchronous 16bit Multiplier with micropipelined structure (마이크로파이프라인 구조의 16bit 비동기 곱셈기)

  • 장미숙;이유진;김학윤;이우석;최호용
    • Proceedings of the IEEK Conference
    • /
    • 2000.06b
    • /
    • pp.145-148
    • /
    • 2000
  • A 16bit asynchronous multiplier has been designed using micropipelind structure with 2 phase and data bundling. And 4-radix modified Booth algorithm, CPlatch(Cature-Pass latch) and modified 4-2 counters have adopted in this design. It is implemented in 0.65$\mu\textrm{m}$ double-poly/double-metal CMOS technology by using 12,074 transistors with core size of 1.4${\times}$1.8$\textrm{mm}^2$. And our design results in a computation rate 55MHz a supply voltage of 3.3V.

  • PDF

Design of an 8-bit Color Adjustor for SDTV Using Verilog HDL (Verilog HDL을 이용한 SDTV용 8bit 색상 보정기의 설계)

  • Jeon, Byoung-Woong;Song, In-Chae
    • Proceedings of the IEEK Conference
    • /
    • 2005.11a
    • /
    • pp.801-804
    • /
    • 2005
  • In this paper, we designed an 8-bit color adjustor for SDTV using Verilog HDL. The conversion block requires a lot of multiplication. So we adopted Booth algorithm to reduce amount of operation and processing time. To improve speed, we designed the system output as parallel structure. We synthesized the designed system using Xilinx ISE and verified the operation through simulation using Modelsim.

  • PDF

VLSI Implementation of High Speed Variable-Length RSA Crytosystem (가변길이 고속 RSA 암호시스템의 VLSI 구현)

  • 박진영;서영호;김동욱
    • Proceedings of the IEEK Conference
    • /
    • 2002.06b
    • /
    • pp.285-288
    • /
    • 2002
  • In this paper, a new structure of 1024-bit high-speed RSA cryptosystem has been proposed and implemented in hardware to increase the operation speed and enhance the variable-length operation in the plain text. The proposed algorithm applied a radix-4 Booth algorithm and CSA(Carry Save Adder) to the Montgomery algorithm for modular multiplication As the results from implementation, the clock period was approached to one delay of a full adder and the operation speed was 150MHz. The total amount of hardware was about 195k gates. The cryptosystem operates as the effective length of the inputted modulus number, which makes variable length encryption rather than the fixed-length one. Therefore, a high-speed variable-length RSA cryptosystem could be implemented.

  • PDF

A 32${\times}$32-b Multiplier Using a New Method to Reduce a Compression Level of Partial Products (부분곱 압축단을 줄인 32${\times}$32 비트 곱셈기)

  • 홍상민;김병민;정인호;조태원
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.40 no.6
    • /
    • pp.447-458
    • /
    • 2003
  • A high speed multiplier is essential basic building block for digital signal processors today. Typically iterative algorithms in Signal processing applications are realized which need a large number of multiply, add and accumulate operations. This paper describes a macro block of a parallel structured multiplier which has adopted a 32$\times$32-b regularly structured tree (RST). To improve the speed of the tree part, modified partial product generation method has been devised at architecture level. This reduces the 4 levels of compression stage to 3 levels, and propagation delay in Wallace tree structure by utilizing 4-2 compressor as well. Furthermore, this enables tree part to be combined with four modular block to construct a CSA tree (carry save adder tree). Therefore, combined with four modular block to construct a CSA tree (carry save adder tree). Therefore, multiplier architecture can be regularly laid out with same modules composed of Booth selectors, compressors and Modified Partial Product Generators (MPPG). At the circuit level new Booth selector with less transistors and encoder are proposed. The reduction in the number of transistors in Booth selector has a greater impact on the total transistor count. The transistor count of designed selector is 9 using PTL(Pass Transistor Logic). This reduces the transistor count by 50% as compared with that of the conventional one. The designed multiplier in 0.25${\mu}{\textrm}{m}$ technology, 2.5V, 1-poly and 5-metal CMOS process is simulated by Hspice and Epic. Delay is 4.2㎱ and average power consumes 1.81㎽/MHz. This result is far better than conventional multiplier with equal or better than the best one published.

New VLSI Architecture of Parallel Multiplier-Accumulator Based on Radix-2 Modified Booth Algorithm (Radix-2 MBA 기반 병렬 MAC의 VLSI 구조)

  • Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.4
    • /
    • pp.94-104
    • /
    • 2008
  • In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with $250{\mu}m,\;180{\mu}m,\;130{\mu}m$ and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.

A Low-Complexity 128-Point Mixed-Radix FFT Processor for MB-OFDM UWB Systems

  • Cho, Sang-In;Kang, Kyu-Min
    • ETRI Journal
    • /
    • v.32 no.1
    • /
    • pp.1-10
    • /
    • 2010
  • In this paper, we present a fast Fourier transform (FFT) processor with four parallel data paths for multiband orthogonal frequency-division multiplexing ultra-wideband systems. The proposed 128-point FFT processor employs both a modified radix-$2^4$ algorithm and a radix-$2^3$ algorithm to significantly reduce the numbers of complex constant multipliers and complex booth multipliers. It also employs substructure-sharing multiplication units instead of constant multipliers to efficiently conduct multiplication operations with only addition and shift operations. The proposed FFT processor is implemented and tested using 0.18 ${\mu}m$ CMOS technology with a supply voltage of 1.8 V. The hardware- efficient 128-point FFT processor with four data streams can support a data processing rate of up to 1 Gsample/s while consuming 112 mW. The implementation results show that the proposed 128-point mixed-radix FFT architecture significantly reduces the hardware cost and power consumption in comparison to existing 128-point FFT architectures.

Implementation of Ray Tracing Processor for the Parallel Processing (병렬처리를 위한 고속 Ray Tracing 프로세서의 설계)

  • Choe, Gyu-Yeol;Jeong, Deok-Jin
    • The Transactions of the Korean Institute of Electrical Engineers A
    • /
    • v.48 no.5
    • /
    • pp.636-642
    • /
    • 1999
  • The synthesis of the 3D images is the most important part of the virtual reality. The ray tracing is the best method for reality in the 3D graphics. But the ray tracing requires long computation time for the synthesis of the 3D images. So, we implement the ray tracing with software and hardware. Specially we design the hit-test unit with FPGA tool for the ray tracing. Hit-test unit is a very important part of ray tracing to improve the speed. In this paper, we proposed a new hit-test algorithm and apply the parallel architecture for hit-test unit to improve the speed. We optimized the arithmetic unit because the critical path of hit-test unit is in the multiplication part. We used the booth algorithm and the baugh-wooley algorithm to reduce the partial product and adapted the CSA and CLA to improve the efficiency of the partial product addition. Our new Ray tracing processor can produce the image about 512ms/F and can be adapted to real-time application with only 10 parallel processors.

  • PDF

A 200-MHZ@2.5-V Dual-Mode Multiplier for Single / Double -Precision Multiplications (단정도/배정도 승산을 위한 200-MHZ@2.5-V 이중 모드 승산기)

  • 이종남;박종화;신경욱
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.4 no.5
    • /
    • pp.1143-1150
    • /
    • 2000
  • A dual-mode multiplier (DMM) that performs single- and double-precision multiplications has been designed using a $0.25-\mum$ 5-metal CMOS technology. An algorithm for efficiently implementing double-precision multiplication with a single-precision multiplier was proposed, which is based on partitioning double-precision multiplication into four single-precision sub-multiplications and computing them with sequential accumulations. When compared with conventional double-precision multipliers, our approach reduces the hardware complexity by about one third resulting in small silicon area and low-power dissipation at the expense of increased latency and throughput cycles. The DMM consists of a $28-b\times28-b$ single-precision multiplier designed using radix-4 Booth receding and redundant binary (RB) arithmetic, an accumulator and a simple control logic for mode selection. It contains about 25,000 transistors on the area of about $0.77\times0.40-m^2$. The HSPICE simulation results show that the DMM core can safely operate with 200-MHZ clock at 2.5-V, and its estimated power dissipation is about 130-㎽ at double-precision mode.

  • PDF