# High-throughput Low-complexity Mixed-radix FFT Processor using a Dual-path Shared Complex Constant Multiplier Tram Thi Bao Nguyen and Hanho Lee Abstract—This paper presents a high-throughput lowcomplexity 512-point eight-parallel mixed-radix multipath delay feedback (MDF) fast Fourier transform (FFT) processor architecture orthogonal frequency division multiplexing (OFDM) applications. To decrease the number of twiddle factor (TF) multiplications, a mixed-radix 2<sup>4</sup>/2<sup>3</sup> FFT algorithm is adopted. Moreover, a dual-path shared canonical signed digit (CSD) complex constant multiplier using a multi-layer scheme is proposed for reducing the hardware complexity of the TF multiplication. The proposed FFT processor is implemented using TSMC 90-nm CMOS technology. The synthesis results demonstrate that the proposed FFT processor can lead to a 16% reduction in hardware complexity and higher throughput compared to conventional architectures. Index Terms—Fast Fourier transform (FFT), mixedradix, multipath delay feedback (MDF), dual-path, complex constant multiplier, orthogonal frequency division multiplexing (OFDM) #### I. Introduction The fast Fourier transform (FFT) algorithm is primarily a fast and efficient method to compute the discrete Fourier transform (DFT). FFT processors are Manuscript received Dec. 23, 2015; accepted Dec. 25, 2016 Dept. of Information and Communication Engr. Inha University, Incheon. 22212. Korea E-mail: hhlee@inha.ac.kr among the most widely used components in various digital signal processing (DSP) applications and systems. Moreover, FFT processors gained intense research interest with the appearance of orthogonal frequency division multiplexing (OFDM) communication systems, not only in digital broadcasting systems and mobile telecommunications but also in power line communication (PLC) systems [1-3]. In particular, in recent years, with the increasing demand for multimedia applications using wireless transmissions over short distances, the millimeter wave (mmWave) 60 GHz wireless personal area network (WPAN) has been intensively researched for many years. Also, the IEEE 802.11 Task Group ad (IEEE 802.11ad) has been developed a standard for the mmWave wireless local area network (WLAN) and WPAN systems [4]. In the physical layer design of high-rate WPANs, OFDM modulation has been adopted, and the FFT processor has a high hardware complexity. One OFDM symbol in the IEEE 802.11ad standards consists of a length of 512 subcarriers. Therefore, FFT processor conducts the FFT computation with 512-point arithmetic. Because of its massive computational complexity, FFT processor architectures require high hardware complexity and power consumption. Thus, several FFT processor architectures have been proposed to reduce the hardware complexity and to provide higher throughput. Among various FFT algorithms, the Cooley and Tukey algorithm [5] is highly popular because they first introduced the concept of FFT, which can reduce the computational complexity by making efficient use of the symmetry and periodicity properties of the twiddle factors (TFs). To further reduce the computational complexity, several algorithms have been proposed, including radix-2<sup>3</sup> [6], radix-2<sup>4</sup> [7, 8], radix-2<sup>5</sup> [9], and mixed-radix [10]. In common, these higher-radix reduce the number of multiplications in the radix-2 algorithm. Other studies have been done on parallel FFT architectures, which can achieve higher throughput with lower computation latency [6-8]. However, some critical problems still exist and need improvement for the speed, area, and power consumption considerations. Therefore, this paper focuses on the throughput and hardware complexity improvement for FFT processor architectures. In this paper, we propose a high-throughput and low-complexity 512-point eight-parallel multipath delay feedback (MDF) architecture using an area-efficient mixed-radix 2<sup>4</sup>/2<sup>3</sup> FFT algorithm. In addition, we propose the architecture of a dual-path shared multi-layer canonical signed digit (CSD) complex constant multiplier (DPS-MLCCM) to reduce the hardware complexity for TF multiplication of parallel FFT processors. The proposed FFT processor provides better throughput and less hardware complexity compared to previous designs [9-11]. The rest of this paper is organized as follows. Section II describes the mixed-radix $2^4/2^3$ FFT algorithm. In Section III, the architecture of the proposed mixed-radix FFT processor and the DPS-MLCCM for TF multiplication is presented. Section IV presents the implementation results and performance comparison. Finally, conclusions are provided in Section V. # II. MIXED-RADIX 2<sup>4</sup>/2<sup>3</sup> FFT ALGORITHM The DFT of length N is defined as $$X(k) = \sum_{n=0}^{N-1} x(n) \cdot W_N^{nk}; \qquad 0 \le k \le N-1$$ (1) where k is the frequency index and n is the time index; the TF is defined as $$W_{N}^{nk} = e^{-j2\pi \frac{nk}{N}} \tag{2}$$ The basic idea of FFT is based on the fundamental principle of decomposing the computation of the *N*-point DFT into successively smaller DFTs [5]. #### 1. Radix-2<sup>3</sup> Algorithm To derive the radix-2<sup>3</sup> algorithm, the first three steps in cascade decomposition are considered. The linear index mapping is transformed into four-dimensional linear index maps [6]: $$n = \left\langle \frac{N}{2} n_1 + \frac{N}{4} n_2 + \frac{N}{8} n_3 + n_4 \right\rangle_N$$ $$k = \left\langle k_1 + 2k_2 + 4k_3 + 8k_4 \right\rangle_N$$ (3) Applying the four-dimensional linear index map to (1), $$X\left(k_{1}+2k_{2}+4k_{3}+8k_{4}\right)$$ $$=\sum_{n_{4}=0}^{N}\sum_{n_{3}=0}^{1}\sum_{n_{2}=0}^{1}\sum_{n_{1}=0}^{1}x\left(\frac{N}{2}n_{1}+\frac{N}{4}n_{2}+\frac{N}{8}n_{3}+n_{4}\right)\cdot W_{N}^{nk}$$ $$=\sum_{n_{4}=0}^{N}\left[L_{\frac{N}{8}}\left(n_{4},k_{1},k_{2},k_{3}\right)\cdot W_{N}^{n_{4}(k_{1}+2k_{2}+4k_{3})}\right]\cdot W_{\frac{N}{8}}^{n_{4}k_{4}}$$ (4) With the cascade decomposition, the TF can be expressed in the form of $$W_{N}^{(\frac{N}{2}n_{1}+\frac{N}{4}n_{2}+\frac{N}{8}n_{3}+n_{4})(k_{1}+2k_{2}+4k_{3}+8k_{4})}$$ $$=\underbrace{(-1)^{n_{1}k_{1}}}_{Stage1BF}\underbrace{(-j)^{n_{2}k_{1}}}_{Stage2BF}\underbrace{(-1)^{n_{2}k_{2}}}_{Stage2BF}\underbrace{W_{8}^{n_{3}(k_{1}+2k_{2})}}_{Stage3BF}$$ $$\underbrace{(-1)^{n_{3}k_{3}}}_{Stage3BF}\underbrace{W_{N}^{n_{4}(k_{1}+2k_{2}+4k_{3})}}_{N}\underbrace{W_{N}^{n_{4}k_{4}}}_{8}$$ (5) ### 2. Radix-2<sup>4</sup> Algorithm In [7], radix-2<sup>4</sup> algorithm is derived by considering the first four steps of decomposition. Applying a five-dimensional linear index map, $$n = \left\langle \frac{N}{2} n_1 + \frac{N}{4} n_2 + \frac{N}{8} n_3 + \frac{N}{16} n_4 + n_5 \right\rangle_N$$ $$k = \left\langle k_1 + 2k_2 + 4k_3 + 8k_4 + 16k_5 \right\rangle_M$$ (6) The CFA takes the form of $$X\left(k_{1}+2k_{2}+4k_{3}+8k_{4}\right)$$ $$=\sum_{n_{5}=0}^{\frac{N}{16}-1}\sum_{n_{4}=0}^{1}\sum_{n_{3}=0}^{1}\sum_{n_{2}=0}^{1}\sum_{n_{1}=0}^{1}x\left(\frac{N}{2}n_{1}+\frac{N}{4}n_{2}+\frac{N}{8}n_{3}+\frac{N}{16}n_{4}+n_{5}\right)\cdot W_{N}^{nk}$$ $$=\sum_{n_{5}=0}^{\frac{N}{16}-1}\left[G_{\frac{N}{16}}(n_{5},k_{1},k_{2},k_{3},k_{4})\cdot W_{N}^{n_{5}(k_{1}+2k_{2}+4k_{3}+8k_{4})}\right]\cdot W_{\frac{N}{16}}^{n_{5}k_{5}}$$ $$(7)$$ With the cascade decomposition, the TF can be expressed in the form of $$W_{N}^{(\frac{N}{2}n_{1}+\frac{N}{4}n_{2}+\frac{N}{8}n_{3}+\frac{N}{16}n_{4}+n_{5})(k_{1}+2k_{2}+4k_{3}+8k_{4}+16k_{5})}$$ $$=\underbrace{(-1)^{n_{1}k_{1}}}_{Stage1BF}\underbrace{(-j)^{n_{2}k_{1}}}_{Stage2BF}\underbrace{(-1)^{n_{2}k_{2}}}_{Stage2BF}\underbrace{W_{16}^{(2n_{3}+n_{4})(k_{1}+2k_{2})}}_{Stage4TF}$$ $$\underbrace{(-1)^{n_{3}k_{3}}}_{Stage3BF}\underbrace{(-j)^{n_{4}k_{3}}}_{Stage4BF}\underbrace{(-1)^{n_{4}k_{4}}}_{Stage4BF}\underbrace{W_{N}^{n_{5}(k_{1}+2k_{2}+4k_{3}+8k_{4})}}_{W_{N}^{n_{5}k_{5}}}\underbrace{W_{N}^{n_{5}k_{5}}}_{16}$$ (8) ## 3. Mixed-radix 2<sup>4</sup>/2<sup>3</sup> Algorithm Many conventional FFT algorithms are variations of the Cooley-Tukey algorithm. The FFT architecture, in which the radices of component butterflies are not all equal, is called the mixed-radix FFT. If the radix is properly chosen for different stages on the basis of the FFT specifications, the optimized design for speed and area can be obtained. Then, the proposed 512-point FFT processor adopts the mixed-radix $2^4/2^3$ FFT algorithm. Fig. 1 shows a signal flow graph for the 64-point mixed-radix $2^4/2^3$ algorithm. The mixed-radix $2^4/2^3$ algorithm for 512-point FFT computation consists of nine stages in which the radix- $2^4$ is adopted in the first four stages and the radix- $2^3$ algorithm is used in the remaining stages. Table 1 presents the sequence of the 512-point FFT TF computation at each stage for the radix- $2^k$ and mixed-radix algorithm. As can be observed, the mixed-radix $2^4/2^3$ algorithm can reduce the number of TF multiplications. Hence, the area and power consumption of the complex multipliers can be reduced accordingly. #### III. PROPOSED FFT ARCHITECTURE #### 1. Proposed MDF Architecture Several pipeline architectures for FFT processor have **Fig. 1.** Signal flow graph of the 64-point mixed-radix $2^4/2^3$ algorithm. **Table 1.** Sequence of the 512-point FFT TF computation for radix- $2^k$ and the mixed-radix algorithm | Algorithm | Stage | | | | | | | | | |------------------------------------------------|-------|-----------------|-----------|------------------|------------------|----------------|----------|----------|--| | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | | | Radix-2 <sup>2</sup> | -j | $W_{512}$ | -j | $W_{128}$ | -j | $W_{32}$ | -j | $W_8$ | | | Radix-2 <sup>3</sup> | -j | $W_8$ | $W_{512}$ | -j | $W_8$ | $W_{64}$ | -j | $W_8$ | | | Radix-2 <sup>4</sup> | -j | $W_{16}$ | -j | W <sub>512</sub> | -j | $W_{16}$ | -j | $W_{32}$ | | | Radix-2 <sup>5</sup> | -j | $W_8$ | $W_{32}$ | -j | W <sub>512</sub> | -j | $W_8$ | $W_{32}$ | | | Mixed-<br>Radix-2 <sup>4</sup> /2 <sup>3</sup> | -j | W <sub>16</sub> | -j | W <sub>512</sub> | -j | W <sub>8</sub> | $W_{32}$ | -j | | been proposed over the past few decades [6]. In general, the multi-path delay commutator (MDC) scheme can achieve a higher throughput rate while the single-path delay feedback (SDF) scheme requires less memory and hardware complexity [12]. The proposed MDF architecture can provide high throughput rate with minimal hardware cost by incorporate the features of MDC and SDF [12-14]. The block diagram of the proposed eight-parallel 512-point mixed-radix $2^4/2^3$ MDF FFT processor architecture is presented in Fig. 2. The architecture contains nine stages with the following sub-blocks: the IFFT/FFT select unit, based on the duality between the IFFT and FFT characteristic; module 1, based on the radix- $2^4$ FFT algorithm for stage 1 to stage 4; module 2, which consists **Fig. 2.** Block diagram of the proposed 512-point mixed-radix $2^4/2^3$ MDF FFT processor. Fig. 3. Block diagram of Module 1. of stage 5 to stage 9 relying on the radix-2<sup>3</sup> FFT algorithm; and the top control unit. The details of each module are discussed as follows. Module 1 has a radix- $2^4$ structure as depicted in Fig. 3. The input data is processed in eight-parallel data-paths. Module 1 covers from stage 1 to stage 4 of the proposed FFT processor and consists of the first-in first-out (FIFO) registers, two types of butterfly unit (BF1 and BF2), the conventional CSD complex multiplier (CCM) using a common sub-expression sharing (CSS) technique for stage 2, and the proposed DPS-MLCCM using the CSS technique for TF $W_{512}$ in stage 4. The BF1 only implements complex additions and subtractions while the BF2 includes TF $W_4$ multiplication utilizing the multiplexers and control signals [9]. Module 2 is realized by the radix- $2^3$ FFT algorithm as shown in Fig. 4. Most of the components of module 2 are similar to that of module 1: FIFO registers, butterfly units BF1 and BF2, the CCM using the CSS technique for stage 6, and a dual-path shared CCM (DPS-CCM) using the CSS technique for TF $W_{32}$ in stage 7. Fig. 4. Block diagram of Module 2. # 2. Proposed Dual-path Shared Multi-layer CSD Complex Constant Multiplier A complex multiplier is a component that has a critical effect on the hardware complexity, the power consumption, and throughput of FFT processors. Even if a low-complexity FFT algorithm is adopted, the complex multiplier can be realized by various approaches to reduce hardware complexity. Generally, the CSD representation of a TF is able to reduce hardware complexity better than binary representation of a TF when the TF has only few coefficients. For a TF with many coefficients, the Booth multiplier has been widely used in existing research. However, its primary problem is that it requires high hardware complexity. In this paper, a novel DPS-MLCCM architecture for TF W<sub>512</sub> multiplication is presented. First, we consider the multiplications of the input $x\{Re,Im\}$ with the TF $W_{512}^t = e^{(-j2\pi t/512)} = X_t + jY_t$ (where t is from 0 to 465), which can be divided into eight regions as illustrated in Fig. 5. From the symmetry property, the TFs in other regions can be obtained by mapping from region A. The value of t is mapped to t', i.e., $W_{512}^{t'} = X_{t'} + jY_{t'}$ , where t' is from 0 to 64 in region A. Second, the TF $W_{512}^{t'}$ is decomposed so that the multiplication can be implemented by a multi-layer CCM. The decomposition is applied to reduce the number of CSD constant multipliers needed. The multiplication of input $x\{Re,Im\}$ with the TF $W_{512}^{t'} = X_{t'} + jY_{t'}$ is now derived as **Fig. 5.** TFs with the corresponding mapping in eight regions of the unit circle for DPS-MLCCM. Fig. 6. Block diagram of DPS-MLCCM for $W_{512}$ TF multiplication. $$x \times W_{512}'' = x \times W_{512}^{(8t_1+t_2)} = x \times W_{512}^{8t_1} \times W_{512}^{t_2} \qquad ; where \begin{cases} 0 \le t_1 \le 8 \\ 0 \le t_2 \le 7 \end{cases}$$ $$= \underbrace{[\text{Re}\{x\} + j \text{Im}\{x\}] \times W_{512}^{8t_1}}_{Layer1} \times W_{512}^{t_2}$$ $$= \underbrace{[\text{Re}\{Out_{Layer1}\} + j \text{Im}\{Out_{Layer1}\}] \times W_{512}^{t_2}}_{Layer2} \qquad (9)$$ Through the TF mapping and decomposition process, the entire multiplication unit can be implemented using 15 different TFs in total. The number of TFs needed for layer 1 and layer 2 of the proposed DPS-MLCCM are eight and seven, respectively. Fig. 6 presents the block diagram of the proposed DPS-MLCCM for $W_{512}$ TF multiplication. The input data, together with the mapping TF coefficient and the regional selection data generated from the TF controller block, pass through two layers of the multiplier to generate the output result. On the basis of the characteristic of the scheduling of the TF in the DPS-MLCCM layer 2, the path sharing scheme can be applied in this layer for further hardware complexity reduction. Layer 1 of the multiplier is responsible for the Fig. 7. Architecture of the pre-CCM in DPS-MLCCM layer 1 for $W_{512}$ TF multiplication. complex multiplication of input x with a TF $W_{512}^{8t_1}$ , $0 \le t_1 \le 8$ . Since the number of TFs is quite small, the multiplication in layer 1 can be exploited using the pre-CCM. When the TF is equal to $W_{512}^0$ , the pre-CCM is operated by a bypass operation without an additional CCM. Therefore, this layer consists of a CCM using only 16 coefficients. In addition, the CSS technique is applied to these coefficients, for minimizing hardware complexity. The detailed hardware architecture of the pre-CCM of the proposed DPS-MLCCM for $W_{512}$ TF multiplication is shown in Fig. 7. This pre-CCM of layer 1 consists of the CCM using the CSS technique, the two's complement logic, and the multiplexers for proper control. The output of layer 1 is then sent to the input of the DPS-CCM in layer 2 of the DPS-MLCCM for the remaining computations to be implemented. Layer 2 of the DPS-MLCCM is responsible for the multiplication of the input data set $Out_{L1}\{Re, Im\}$ , which is the output from layer 1 of the multiplier with the TFs $W_{512}^{t_2}$ , $0 \le t_2 \le 7$ . When the TF is equal to $W_{512}^0$ , no multiplier is required for computation, as described in the previous section. Therefore, this layer consists of only seven complex Fig. 8. Scheduling of the TF at each time slot. multipliers. Similarly, the TF multiplication in layer 2 can be exploited using 14 CCMs. On the basis of the characteristic of the scheduling of the TF in eight-parallel data-paths, the proposed dual-path sharing technique can be applied in layer 2 of the DPS-MLCCM to further reduce the hardware complexity. Fig. 8 shows the corresponding values of TF $W_{512}^{t_2}$ for eight data-paths at different time slots. According to the scheduling of the TFs at each time slot for eight data-paths in the proposed 512-point mixed-radix FFT processor, it can be observed that the same TF never occurs at the same time in each pair of these two parallel data-paths: path 1 and path 6, path 2 and path 7, path 3 and path 8, and path 4 and path 5. However, the CCM used for $t_2 = 4$ must be duplicated in order to avoid the conflict of DPS-CCM control. Consequently, a total of 16 CCMs are required in the DPS-CCM architecture. The detailed DPS-CCM architecture of layer 2 of the proposed DPS-MLCCM for $W_{5/2}$ TF multiplication is shown in Fig. 9. This layer consists of the CCM using the sharing technique, two's complement logic, and the multiplexers for appropriate control of regional remapping. # IV. IMPLEMENTATION RESULTS AND PERFORMANCE COMPARISON Prior to the hardware implementation of the proposed FFT processor, an appropriate word length and a quantization error performance evaluation is determined by a fixed-point simulation using MATLAB. From the simulation results, a 12-bit word length is chosen for both the real and imaginary parts because the output signal to noise **Fig. 9.** Architecture of DPS-CCM in DPS-MLCCM layer 2 for $W_{5/2}$ TF multiplication. ratio (SNR) was saturated at a 12-bit word length. The determined word length not only keeps the quantization noise to the least value but also can minimize the hardware complexity. When the word length is set to 12 bits, the proposed FFT processor architecture yields a signal to quantization noise ratio (SQNR) of 41.2 dB without using a data scaling approach. After a proper word length was determined, the proposed FFT architecture was designed in Register Transfer Level (RTL) using Verilog HDL and functionally verified using a commercial Verilog HDL simulator. In addition, the entire design was synthesized using a Synopsys design compiler with a TSMC 90-nm CMOS technology optimized for a 1.2-V supply voltage. The proposed processor has a 243,000 gate count, and the operating clock frequency is 385 MHz. Table 2 presents a comparison of the hardware complexity between different 512-point FFT processor architectures. In order to compare the hardware complexity, the complex multipliers are synthesized and **Table 2.** Hardware complexity comparison of different 512-point FFT processors | | [9] | [10] | [11] | Proposed | |----------------------------------------------------|--------|-------------------|---------|----------| | FFT size | 512 | 512 | 512 | 512 | | No. of CR | 504 | 504 | 640 | 504 | | No. of CA | 120 | 120 | 112 | 120 | | No. of CBM<br>$N.A^g = 1$ | 8 | 8×<br>(0.76+0.57) | 16×0.76 | - | | No. of CCM ( $W_8$ )<br>N.A <sup>g</sup> = 0.12 | 8 | 9 | - | 8 | | No. of CCM ( $W_{16}$ )<br>N. $A^g = 0.28$ | 8 | 8 | 12 | 8 | | No. of CCM $(W_{32})$<br>N.A <sup>g</sup> = 0.46 | 8 | - | - | - | | No. of DPS-CCM<br>(W <sub>32</sub> )<br>N.A = 0.43 | ı | 1 | - | 3 | | No. of DPS-<br>MLCCM ( $W_{512}$ )<br>N.A = 1.9 | - | - | - | 4 | | Total N.A of complex mult. | 14.88 | 13.96 | 15.52 | 12.09 | | TF LUT size (bits) | 10,240 | - | - | - | CR denotes the complex registers. CA denotes the complex adders. CBM denotes the complex Booth multiplier. CCM denotes the complex constant multiplier. DPS-CCM denotes the dual-path shared CCM. DPS-MLCCM denotes the dual-path shared multi-layer CCM. N.A denotes the normalized area. then the area of each multiplier was normalized. Compared with other architectures, the proposed architecture has the lowest total normalized area of the complex multipliers. In addition, there is no need to allocate memory to store the twiddle factor. Table 3 shows the performance comparisons between the proposed 512-point eight-parallel mixed-radix $2^4/2^3$ MDF FFT/IFFT processor using CCM and several existing 512-point FFT processors [9-11]. The results show that the proposed FFT processor obtains much better SQNR performance than that of [9]. The design in [9] is also a pipelined eight-parallel MDF architecture for a 512-point FFT processor. However, it requires much more complex multipliers and memory than the proposed design. Therefore, the proposed architecture results in a low hardware complexity. Moreover, the highest throughput of the proposed FFT processor can reach 3.08 GS/s at 385 MHz by employing eight-parallel data-paths. The throughput rate is the fastest among the presented algorithms in Table 3. Finally, it is quite evident from Table 3 that the proposed architecture has an advantage in terms of hardware **Table 3.** Performance of the proposed FFT processor compared with previous implementations | | [9] | [10] | [11] | Proposed | |-----------------------------|----------------------------------|---------------------------|------------------|---------------------------------------------------| | Technology | 90-nm | 90-nm | 90-nm | 90-nm | | Architecture | MDF | MDF | Memory-<br>based | MDF | | FFT Size | 512 | 512 | 512 | 512 | | No. of data-paths | 8 | 8 | 16 | 8 | | Algorithm | Modified<br>Radix-2 <sup>5</sup> | Mixed-Radix $2^4/2^2/2^3$ | Radix-16 | Mixed-<br>Radix<br>2 <sup>4</sup> /2 <sup>3</sup> | | Internal word length (bits) | 12 | 14 | 12 | 12 | | SQNR (dB) | 35 | - | 57 | 41.2 | | Measurement | Logic synthesis | Post-<br>layout | Post-<br>layout | Logic synthesis | | Clock rate<br>(MHz) | 310 | 220 | 324 | 385 | | Throughput (GS/s) | 2.5 | 1.76 | 2.59 | 3.08 | | Gate count | 290K | - | - | 243K | complexity; the gate count for the overall FFT processor is reduced more than 16% compared to that of [9]. This hardware complexity reduction results from using the proposed multi-layer CSD complex multiplier architectures and the sharing technique. ## V. CONCLUSIONS This paper presents a 512-point eight-parallel MDF mixed-radix 24/23 FFT processor using a novel DPS-MLCCM. In particular, a dual-path sharing technique and a multi-layer CCM architecture for TF multiplication is proposed to efficiently reduce the hardware complexity of the FFT processor based on parallel MDF architecture. From fixed-point simulation results, the SQNR performance is 41.2 dB with a 12-bit word-length implementation. Total estimated NAND gates are 243,000 from the synthesized results, and the throughput is 3.08 GS/s for the proposed FFT processor. The proposed FFT processor is the most area-efficient and high-throughput architecture for the eight-parallel 512point MDF FFT processors. Therefore, the proposed FFT processor is a promising solution for OFDM systems that require high throughput and low complexity. #### **ACKNOWLEDGMENTS** This work was supported by Inha University Research Grant. #### REFERENCES - [1] U. Reimers, "Digital video broadcasting," *Communications Magazine, IEEE*, Vol. 36, No. 6, pp. 104-110, Jun. 1998. - [2] G. Faria, J. Henriksson, E. Stare, and P. Talmola, "DVB-H: digital broadcast services to handheld devices," *Proceedings of the IEEE*, Vol. 94, No. 1, pp. 194-209, Jan. 2006. - [3] T. Zhang and W. Liu, "FFT-based OFDM in broadband-PLC and narrowband," *Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)*, 2012 International Conference, pp. 473-478, Oct. 2012. - [4] [Online]. Available: http://www.ieee802.org/11/ Reports/tgad\_update.htm - [5] J. Cooley and J. Tukey, "An algorithm for the machine calculation of complex Fourier series, "Mathematics and Computation, Vol. 19, No. 90, pp. 297-301, Apr. 1965. - [6] S. He and M. Torkelson, "Designing pipeline FFT processor for OFDM (de)modulation," URSI International Symposium on Signals, Systems, and Electronics, 1998, ISSSE 98, pp. 257-262, Oct. 1998. - [7] J. Lee and H. Lee, "A high-speed two-parallel radix-2<sup>4</sup> FFT/IFFT processor for MB-OFDM UWB systems," *IEICE Trans. on Fundamentals of Electronics, Communications and Computer Sciences*, Vol. E91-A, No. 4, pp. 1206-1211, Apr. 2008. - [8] M. Garrido, J. Grajal, M. Sanchez, and O. Gustafsson, "Pipelined radix-2<sup>k</sup> feed forward FFT architectures," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, Vol. 21, No. 1, pp. 23-32, Jan. 2013. - [9] T. Cho and H. Lee, "A high-speed low-complexity modified radix-2<sup>5</sup> FFT processor for high-rate WPAN applications," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, Vol. 21, No. 1, pp. 187-191, Jan. 2013. - [10] C. Wang, Y. Yan, and X. Fu, "A high-throughput low-complexity radix-2<sup>4</sup>-2<sup>2</sup>-2<sup>3</sup> FFT/IFFT processor with parallel and normal input/output order for IEEE 802.11ad Systems," *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, Vol. 23, No. 11, pp. 2728-2732, Nov. 2015. - [11] S. Huang and S. Chen, "A high-throughput radix-16 FFT processor with parallel and normal input/output ordering for IEEE 802.15.3c systems," *IEEE Trans. on Circuits and System I*, Reg. Papers, Vol. 59, No. 8, pp. 1752–1765, Aug. 2012. - [12] Y. Lin, H. Liu, and C. Lee, "A 1-GS/s FFT/IFFT processor for UWB applications," *IEEE J. Solid-State Circuits*, Vol. 40, No. 8, pp. 1726–1735, Aug. 2005. - [13] S. Tang, J. Tsai, and T. Chang, "A 2.4-GS/s FFT processor for OFDM based WPAN applications," *IEEE Trans. on Circuits and Systems II: Express Briefs*, Vol. 57, No. 6, pp. 451-455, Jun. 2010. - [14] M. Shin and H. Lee, "A high-speed, four-parallel radix-2<sup>4</sup> FFT processor for UWB applications," *in Proc. IEEE International Symposium on Circuits and Systems (ISCAS)*, pp. 960–963, May 2008. Tram Thi Bao Nguyen received the B.S degree in Electronic & Telecommunication Engineering from Danang University of Science and Technology, Vietnam, in 2011 and the M.S degree in Information & Communication Engineering from Inha University, Korea, in 2016. She is currently working toward the Ph.D. degree in Information & Communication Engineering from Inha University, Korea. Her research interests include VLSI and SoC architecture design for digital signal processing and communication systems. Hanho Lee received Ph.D. and M.S. degrees, both in Electrical & Computer Engineering, from the University of Minnesota, Minneapolis, in 2000 and 1996, respectively. In 1999, he was a Member of Technical Staff-1 at Lucent Tech- nologies, Bell Labs, Holmdel, New Jersey. From April 2000 to August 2002, he was a Member of Technical Staff at the Lucent Technologies (Bell Labs Innovations), Allentown. From August 2002 to August 2004, he was an Assistant Professor at the Department of Electrical and Computer Engineering, University of Connecticut, USA. Since August 2004, he has been with the Department of Information and Communication Engineering, Inha University, where he is currently Professor. He was a visiting researcher at Electronics and Telecommunications Research Institute (ETRI), Korea, in 2005. From August 2010 to August 2011, he was a visiting scholar at Bell Labs, Alcatel-Lucent, Murray Hill, New Jersey, USA. His research interest includes VLSI architecture design for digital signal processing, forward error correction architectures, cryptographic systems, and communications.