• Title/Summary/Keyword: Efficient implementation

Search Result 2,952, Processing Time 0.031 seconds

Efficient Parallel Block-layered Nonbinary Quasi-cyclic Low-density Parity-check Decoding on a GPU

  • Thi, Huyen Pham;Lee, Hanho
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.6 no.3
    • /
    • pp.210-219
    • /
    • 2017
  • This paper proposes a modified min-max algorithm (MMMA) for nonbinary quasi-cyclic low-density parity-check (NB-QC-LDPC) codes and an efficient parallel block-layered decoder architecture corresponding to the algorithm on a graphics processing unit (GPU) platform. The algorithm removes multiplications over the Galois field (GF) in the merger step to reduce decoding latency without any performance loss. The decoding implementation on a GPU for NB-QC-LDPC codes achieves improvements in both flexibility and scalability. To perform the decoding on the GPU, data and memory structures suitable for parallel computing are designed. The implementation results for NB-QC-LDPC codes over GF(32) and GF(64) demonstrate that the parallel block-layered decoding on a GPU accelerates the decoding process to provide a faster decoding runtime, and obtains a higher coding gain under a low $10^{-10}$ bit error rate and low $10^{-7}$ frame error rate, compared to existing methods.

Flexible Prime-Field Genus 2 Hyperelliptic Curve Cryptography Processor with Low Power Consumption and Uniform Power Draw

  • Ahmadi, Hamid-Reza;Afzali-Kusha, Ali;Pedram, Massoud;Mosaffa, Mahdi
    • ETRI Journal
    • /
    • v.37 no.1
    • /
    • pp.107-117
    • /
    • 2015
  • This paper presents an energy-efficient (low power) prime-field hyperelliptic curve cryptography (HECC) processor with uniform power draw. The HECC processor performs divisor scalar multiplication on the Jacobian of genus 2 hyperelliptic curves defined over prime fields for arbitrary field and curve parameters. It supports the most frequent case of divisor doubling and addition. The optimized implementation, which is synthesized in a $0.13{\mu}m$ standard CMOS technology, performs an 81-bit divisor multiplication in 503 ms consuming only $6.55{\mu}J$ of energy (average power consumption is $12.76{\mu}W$). In addition, we present a technique to make the power consumption of the HECC processor more uniform and lower the peaks of its power consumption.

Parallel Implementation of Radon Transform on TMS320C80-based System (TMS320C80시스템에서 Radon 변환의 병렬 구현)

  • 송정호;성효경최흥문
    • Proceedings of the IEEK Conference
    • /
    • 1998.10a
    • /
    • pp.727-730
    • /
    • 1998
  • In this paper, we propose an implementation of an efficient parallel Radon transform on TMS320C80-based system. For an N$\times$N SAR image, we can obtain O(NM/p) of the conventional parallel Radon transform, by representing the projection patterns in Radon space variables instead of the image space variables, and pipelining the algorithm, where p is the number of processors and M is the number of projection angles. Also, we can reduce the time for the dynamic load distribution among the nodes and the communication overheads of accessing the global memories, by pipelining the memory and processing operations by using tripple buffer structure. Experimental results show an efficient parallel Radon transform of speedup Sp=3.9 and efficiency E=97.5% for 256$\times$256 image, when implemented on TMS320C80 composed of four parallel slave processors with three memory blocks.

  • PDF

Efficient ATM-PSTN per-trunk interworking (효율적인 ATM-PSTN trunk간의 연동 방안)

  • 이광희;이성창
    • Journal of the Korean Institute of Telematics and Electronics S
    • /
    • v.35S no.2
    • /
    • pp.40-49
    • /
    • 1998
  • In this paper, we propose an efficient per-Trunk interworking mechanism between PSTN and ATM network assuming the situation ATM network interworks with PSTN during the evolution period. The proposed mechanism improves the cell payload utilization by mapping only the active charnnels of PSTN frame into ATM cell payload. Also, we propose the frame recovery mechanism to guarantee the frame sequence integrity. The proposed mechanism is compared with other possible ones in terms of cell payload utilization, cell packetization delay. We present the implementation structure of the interworking unit. the correctness of the mechanism and the feasibility of the implementation are verifid through the CAD simulation.

  • PDF

Implementation of Efficient Exponential Function Approximation Algorithm Using Format Converter Based on Floating Point Operation in FPGA (부동소수점 기반의 포맷 컨버터를 이용한 효율적인 지수 함수 근사화 알고리즘의 FPGA 구현)

  • Kim, Jeong-Seob;Jung, Seul
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.15 no.11
    • /
    • pp.1137-1143
    • /
    • 2009
  • This paper presents the FPGA implementation of efficient algorithms for approximating exponential function based on floating point format data. The Taylor-Maclaurin expansion as a conventional approximation method becomes inefficient since high order expansion is required for the large number to satisfy the approximation error. A format converter is designed to convert fixed data format to floating data format, and then the real number is separated into two fields, an integer field and an exponent field to separately perform mathematic operations. A new assembly command is designed and added to previously developed command set to refer the math table. To test the proposed algorithm, assembly program has been developed. The program is downloaded into the Altera DSP KIT W/STRATIX II EP2S180N Board. Performances of the proposed method are compared with those of the Taylor-Maclaurin expansion.

The Design and Implementation for Efficient C2A (효율적인 방공 지휘통제경보체계를 위한 설계 및 구현)

  • Kwon, Cheol-Hee;Hong, Dong-Ho;Lee, Dong-Yun;Lee, Jong-Soon;Kim, Young-Vin
    • Journal of the Korea Institute of Military Science and Technology
    • /
    • v.12 no.6
    • /
    • pp.733-738
    • /
    • 2009
  • In this paper, we have proposed the design and implementation for efficient Command Control and Alert(C2A). Information fusion must be done for knowing the state and identification of targets using multi-sensor. The threat priority of targets which are processed and identified by information fusion is calculated by air-defence operation logic. The threat targets are assigned to the valid and effective weapons by nearest neighborhood algorithm. Furthermore, the assignment result allows operators to effectively operate C2A by providing the operators with visualizing symbol color and the assignment pairing color line. We introduce the prototype which is implemented by the proposed design and algorithm.

A Study on handling dense columns in interior point methods for linear programming (An efficient implementation of Schur complement method) (내부점 방법에서 밀집열 처리에 관한 연구 (Schur 상보법의 효율적인 구현))

  • 설동렬;도승용;박순달
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 1998.10a
    • /
    • pp.67-70
    • /
    • 1998
  • The computational speed of interior point method of linear programming depends on the speed of Cholesky factorization to solve AΘA$^{T}$ $\Delta$y=b. If the coefficient matrix A has dense columns then the matrix AΘA$^{T}$ becomes a dense matrix. This causes Cholesky factorization to be slow. The Schur complement method is applied to treat dense columns in many implementations but suffers from its numerical unstability. We study efficient implementation of Schur complement method. We achieve improvements in computational speed and numerical stability.rical stability.

  • PDF

A High Speed and Area Efficient FFT Algorithm and Its Hardware Implementation (고속 및 면적 효율적인 FFT 알고리즘 개발 및 하드웨어 구현)

  • 탁연지;정윤호;김재석;박현철;김동규;박준현;유봉위
    • Proceedings of the IEEK Conference
    • /
    • 2000.11b
    • /
    • pp.297-300
    • /
    • 2000
  • This paper proposes a high-speed and area-efficient FFT algorithm and performs a hardware implementation. This algorithm, named by “Radix-4/2”, uses the feature of existing radix-2$^3$algorithm, It reduces the number of non-trivial multipliers in SFG to the ratio of 3 to 2 campared with radix-2 or radix-4 algorithm and radix-4/2 has also twice throughput as radix-2$^3$algorithm's. It is proved that FFT processor using the proposed algorithm and 64-point MDC pipeline architecture has twice throughput as radix-2$^3$algorithm's, and reduces areas by 25 percentages in contrast to radix-4 algorithm's.

  • PDF

A Network Partitioning Using the Concept of Conection Index-Algorithm and Implementation (연결지수의 개념을 사용한 회로망분실-알고리즘 및 실시)

  • 박진섭;박송배
    • Journal of the Korean Institute of Telematics and Electronics
    • /
    • v.21 no.6
    • /
    • pp.94-104
    • /
    • 1984
  • Based on a new concept of connection index of a weighted graph, a new efficient houris tic algorithm of 0(v.e) for network partitioning is presented, where v and e are the number of nodes and edges, respectively. Experimental results show that our algorithm is very efficient and yields an optimal or near optimal solution for a number of partitioning problems tested. Some applications of the proposed algorithm are suggested and its computer implementation is described in detail.

  • PDF

Low Latency Systolic Multiplier over GF(2m) Using Irreducible AOP (기약 AOP를 이용한 GF(2m)상의 낮은 지연시간의 시스톨릭 곱셈기)

  • Kim, Kee-Won;Han, Seung-Chul
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.4
    • /
    • pp.227-233
    • /
    • 2016
  • Efficient finite field arithmetic is essential for fast implementation of error correcting codes and cryptographic applications. Among the arithmetic operations over finite fields, the multiplication is one of the basic arithmetic operations. Therefore an efficient design of a finite field multiplier is required. In this paper, two new bit-parallel systolic multipliers for $GF(2^m)$ fields defined by AOP(all-one polynomial) have proposed. The proposed multipliers have a little bit greater space complexity but save at least 22% area complexity and 13% area-time (AT) complexity as compared to the existing multipliers using AOP. As compared to related works, we have shown that our multipliers have lower area-time complexity, cell delay, and latency. So, we expect that our multipliers are well suited to VLSI implementation.