• 제목/요약/키워드: Bit-Parallel

검색결과 406건 처리시간 0.02초

부분곱의 재정렬과 4:2 변환기법을 이용한 VLSI 고속 병렬 곱셈기의 새로운 구현 방법 (A new scheme for VLSI implementation of fast parallel multiplier using 2x2 submultipliers and ture 4:2 compressors with no carry propagation)

  • 이상구;전영숙
    • 전자공학회논문지C
    • /
    • 제34C권10호
    • /
    • pp.27-35
    • /
    • 1997
  • In this paper, we propose a new scheme for the generation of partial products for VLSI fast parallel multiplier. It adopts a new encoding method which halves the number of partial products using 2x2 submultipliers and rearrangement of primitive partial products. The true 4-input CSA can be achieved with appropriate rearrangement of primitive partial products out of 2x2 submultipliers using the newly proposed theorem on binary number system. A 16bit x 16bit multiplier has been desinged using the proposed method and simulated to prove that the method has comparable speed and area compared to booth's encoding method. Much smaller and faster multiplier could be obtained with far optimization. The proposed scheme can be easily extended to multipliers with inputs of higher resolutions.

  • PDF

다중전송률 DS-CDMA 시스템을 위한 적응다단병렬간섭제거수신기 (Adaptive Multi-stage Parallel Interference Cancellation Receiver for a Multi-rate DS-CDMA System)

  • 한승희;이재홍
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2001년도 하계종합학술대회 논문집(1)
    • /
    • pp.89-92
    • /
    • 2001
  • In this paper, adaptive multi-stage parallel interference cancellation (PIC) receiver is considered for a multi-rate DS-CDMA system. In each stage of the adaptive multi-stage PIC receiver, multiple access interference (MAI) estimates are obtained using the sub-bit estimates from the Previous stage and the adaptive weights for the sub-bit estimates. The adaptive weights are obtained by minimizing the mean squared error between the received signal and its estimate through a least mean square (LMS) algorithm. It is shown that the adaptive multi- stage PIC receiver achieves smaller BER than the matched filter receiver, multi-stage PIC receiver, and multi-stage partial PIC receiver for the multi-rate DS-CDMA system in a Rayleigh fading channel.

  • PDF

시간 주파수 다이버시티를 위한 분할된 확산코드를 이용한 멀티캐리어 CDMA 시스템 (A Multicarrier CDMA System Using Divided Spreading Sequence for Time and Frequency Diversity)

  • 박형근;주양익;김용석;차균현
    • 한국통신학회논문지
    • /
    • 제27권6B호
    • /
    • pp.569-578
    • /
    • 2002
  • This paper proposes a new multicarrier code division multiple access (CDMA) system. The proposed multicarrier CDMA system provides the advantages that the transmission bandwidth is more efficiently utilized by using divided spreading sequence, time and frequency diversity is achieved in frequency selective nultipath (acting channel, and inter-carrier interference (ICI) can be minimized by using specific data and code pattern. In this system, transmitted data bits are serial-to-parallel converted to some parallel branches. On each branch each bit is direct-sequence spread-spectrum modulated by divided spreading sequences and transmitted using orthogonal carriers. The receiver providers a Rake for each carrier, and the outputs of Rakes are combined to get time and frequency diversity. This multicarrier CDMA system allows additional flexibility in the choice of system parameters. Upon varying system parameters, bit error rate (BER) performance is examined for the proposed multicarrier CDMA system. Simulation results show that the proposed multicarrier CDMA scheme can achieve better performance than the other types of conventional multicarrier CDMA systems.

VLSI 지향적인 APP용 2-D SYSTOLIC ARRAY PROCESSOR 설계에 관한 연구 (A Study on VLSI-Oriented 2-D Systolic Array Processor Design for APP (Algebraic Path Problem))

  • 이현수;방정희
    • 전자공학회논문지B
    • /
    • 제30B권7호
    • /
    • pp.1-13
    • /
    • 1993
  • In this paper, the problems of the conventional special-purpose array processor such as the deficiency of flexibility have been investigated. Then, a new modified methodology has been suggested and applied to obtain the common solution of the three typical App algorithms like SP(Shortest Path), TC(Transitive Closure), and MST(Minimun Spanning Tree) among the various APP algorithms using the similar method to obtain the solution. In the newly proposed APP parallel algorithm, real-time Processing is possible, without the structure enhancement and the functional restriction. In addition, we design 2-demensional bit-parallel low-triangular systolic array processor and the 1-PE in detail. For its evaluation, we consider its computational complexity according to bit-processing method and describe relationship of total chip size and execution time. Therefore, the proposed processor obtains, on which a large data inputs in real-time, 3n-4 execution time which is optimal o(n) time complexity, o(n$^{2}$) space complexity which is the number of total gate and pipeline period rate is one.

  • PDF

Accelerating Soft-Decision Reed-Muller Decoding Using a Graphics Processing Unit

  • Uddin, Md. Sharif;Kim, Cheol Hong;Kim, Jong-Myon
    • 예술인문사회 융합 멀티미디어 논문지
    • /
    • 제4권2호
    • /
    • pp.369-378
    • /
    • 2014
  • The Reed-Muller code is one of the efficient algorithms for multiple bit error correction, however, its high-computation requirement inherent in the decoding process prohibits its use in practical applications. To solve this problem, this paper proposes a graphics processing unit (GPU)-based parallel error control approach using Reed-Muller R(r, m) coding for real-time wireless communication systems. GPU offers a high-throughput parallel computing platform that can achieve the desired high-performance decoding by exploiting massive parallelism inherent in the algorithm. In addition, we compare the performance of the GPU-based approach with the equivalent sequential approach that runs on the traditional CPU. The experimental results indicate that the proposed GPU-based approach exceedingly outperforms the sequential approach in terms of execution time, yielding over 70× speedup.

WAVE 시스템에서 행렬 테이블로 연산하기 위한 알고리즘 설계 및 구현 (The Algorithm Design and Implemention for Operation using a Matrix Table in the WAVE system)

  • 이대식;유영모;이상윤;장청룡
    • 한국통신학회논문지
    • /
    • 제37권4A호
    • /
    • pp.189-196
    • /
    • 2012
  • WAVE(Wireless Access for Vehicular Environment) 시스템은 차량용 통신 기술로서, 차량 운전 중 발생 가능한 사고들을 미연에 방지하기 위한 서비스와 차량기능 관리, 시스템 장애를 모니터링하는 각종 서비스를 제공하기 위해 사용된다. 그러나 WAVE 시스템의 스크램블러 비트 연산은 병렬 처리가 불가능하므로 소프트웨어나 하드웨어 설계의 효율성이 떨어지게 된다. 본 논문에서는 스크램블러의 비트 연산 과정으로 행렬 테이블을 구성하는 알고리즘과 입력 데이터와 행렬 테이블을 병렬 연산하는 알고리즘을 제안한다. 본 논문에서 제안한 스크램블러 알고리즘은 입력 데이터의 입력 단위가 8비트, 16비트, 32비트, 64비트냐에 따라 처리 속도가 다르지만 입력 단위에 따라 병렬 처리가 가능하므로 WAVE 시스템의 처리 속도를 더욱 향상시킨다.

워드기반 스트림암호의 병렬화 고속 구현 방안 (On a Parallel-Structured High-Speed Implementation of the Word-Based Stream Cipher)

  • 이훈재;도경훈
    • 한국정보통신학회논문지
    • /
    • 제14권4호
    • /
    • pp.859-867
    • /
    • 2010
  • 본 논문에서는 일반적인 비트기반의 비선형 결합함수를 고속화하기 위하여 워드기반 스트림 암호에서 적용될 워드기반 비선형 결합함수 구조를 제안하였다. 특히, 워드기반 병렬구조를 갖는 PS-WFSR을 제안하였고, 이를 활용하여 비트 기반 비선형 결합함수를 고속화시킨 4가지 형태의 워드기반 병렬형 비선형 결합함수를 다음과 같이 제안하였다. m-병렬 워드기반 비메모리 비선형 결합함수, m-병렬 워드기반 메모리 비선형 결합함수, m-병렬 워드기반 비선형 필터함수, m-병렬 워드기반 클럭조절형 함수를 제안하였고, 마지막으로 m-병렬 워드기반 DRAGON의 병렬 구조를 통하여 그 성능을 분석하였다.

GPU 가속기를 통한 비트 연산 최적화 및 DNN 응용 (Bit Operation Optimization and DNN Application using GPU Acceleration)

  • 김상혁;이재흥
    • 전기전자학회논문지
    • /
    • 제23권4호
    • /
    • pp.1314-1320
    • /
    • 2019
  • 본 논문에서는 소프트웨어 환경에서 비트연산을 최적화 하고 DNN으로 응용하는 방법을 제안한다. 이를 위해 비트연산 최적화를 위한 패킹 함수와 DNN으로 응용을 위한 마스킹 행렬 곱 연산을 제안한다. 패킹 함수의 경우는 32bit의 실제 가중치값을 2bit로 변환하는 연산을 수행한다. 연산을 수행할 땐, 임계값 비교 연산을 통해 2bit 값으로 변환한다. 이 연산을 수행하면 4개의 32bit값이 1개의 8bit 메모리에 들어가게 된다. 마스킹 행렬 곱 연산의 경우 패킹된 가중치 값과 일반 입력 값을 곱하기 위한 특수한 연산으로 이루어져 있다. 그리고 각각의 연산은 GPU 가속기를 이용해 병렬로 처리되게 하였다. 그 결과 HandWritten 데이터 셋에 환경에서 32bit DNN 모델에 비해 약 16배의 메모리 절약을 볼 수 있었다. 그럼에도 정확도는 32bit 모델과 비슷한 1% 이내의 차이를 보였다.

코드감소와 성능향상을 위한 이질 레지스터 분할 및 명령어 구조 설계 (Code Size Reduction and Execution performance Improvement with Instruction Set Architecture Design based on Non-homogeneous Register Partition)

  • 권영준;이혁재
    • 대한전기학회논문지:전력기술부문A
    • /
    • 제48권12호
    • /
    • pp.1575-1579
    • /
    • 1999
  • Embedded processors often accommodate two instruction sets, a standard instruction set and a compressed instruction set. With the compressed instruction set, code size can be reduced while instruction count (and consequently execution time) can be increased. To achieve code size reduction without significant increase of execution time, this paper proposes a new compressed instruction set architecture, called TOE (Two Operations Execution). The proposed instruction set format includes the parallel bit that indicates an instruction can be executed simultaneously with the next instruction. To add the parallel bit, TOE instruction format reduces the destination register field. The reduction of the register field limits the number of registers that are accessible by an instruction. To overcome the limited accessibility of registers, TOE adapts non-homogeneous register partition in which registers are divided into multiple subsets, each of which are accessed by different groups of instructions. With non-homogeneous registers, each instruction can access only a limited number of registers, but an entire program can access all available registers. With efficient non-homogeneous register allocator, all registers can be used in a balanced manner. As a result, the increase of code size due to register spills is negligible. Experimental results show that more than 30% of TOE instructions can be executed in parallel without significant increase of code size when compared to existing Thumb instruction set.

  • PDF

Efficient Parallel Block-layered Nonbinary Quasi-cyclic Low-density Parity-check Decoding on a GPU

  • Thi, Huyen Pham;Lee, Hanho
    • IEIE Transactions on Smart Processing and Computing
    • /
    • 제6권3호
    • /
    • pp.210-219
    • /
    • 2017
  • This paper proposes a modified min-max algorithm (MMMA) for nonbinary quasi-cyclic low-density parity-check (NB-QC-LDPC) codes and an efficient parallel block-layered decoder architecture corresponding to the algorithm on a graphics processing unit (GPU) platform. The algorithm removes multiplications over the Galois field (GF) in the merger step to reduce decoding latency without any performance loss. The decoding implementation on a GPU for NB-QC-LDPC codes achieves improvements in both flexibility and scalability. To perform the decoding on the GPU, data and memory structures suitable for parallel computing are designed. The implementation results for NB-QC-LDPC codes over GF(32) and GF(64) demonstrate that the parallel block-layered decoding on a GPU accelerates the decoding process to provide a faster decoding runtime, and obtains a higher coding gain under a low $10^{-10}$ bit error rate and low $10^{-7}$ frame error rate, compared to existing methods.