• Title/Summary/Keyword: pipelined parallel architecture

Search Result 38, Processing Time 0.022 seconds

A Development of JPEG-LS Platform for Mirco Display Environment in AR/VR Device. (AR/VR 마이크로 디스플레이 환경을 고려한 JPEG-LS 플랫폼 개발)

  • Park, Hyun-Moon;Jang, Young-Jong;Kim, Byung-Soo;Hwang, Tae-Ho
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.14 no.2
    • /
    • pp.417-424
    • /
    • 2019
  • This paper presents the design of a JPEG-LS codec for lossless image compression from AR/VR device. The proposed JPEG-LS(: LosSless) codec is mainly composed of a context modeling block, a context update block, a pixel prediction block, a prediction error coding block, a data packetizer block, and a memory block. All operations are organized in a fully pipelined architecture for real time image processing and the LOCO-I compression algorithm using improved 2D approach to compliant with the SBT coding. Compared with a similar study in JPEG-LS, the Block-RAM size of proposed STB-FLC architecture is reduced to 1/3 compact and the parallel design of the predication block could improved the processing speed.

40Gb/s Foward Error Correction Architecture for Optical Communication System (광통신 시스템을 위한 40Gb/s Forward Error Correction 구조 설계)

  • Lee, Seung-Beom;Lee, Han-Ho
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.45 no.2
    • /
    • pp.101-111
    • /
    • 2008
  • This paper introduces a high-speed Reed-Solomon(RS) decoder, which reduces the hardware complexity, and presents an RS decoder based FEC architecture which is used for 40Gb/s optical communication systems. We introduce new pipelined degree computationless modified Euclidean(pDCME) algorithm architecture, which has high throughput and low hardware complexity. The proposed 16 channel RS FEC architecture has two 8 channel RS FEC architectures, which has 8 syndrome computation block and shared single KES block. It can reduce the hardware complexity about 30% compared to the conventional 16 channel 3-parallel FEC architecture, which is 4 syndrome computation block and shared single KES block. The proposed RS FEC architecture has been designed and implemented with the $0.18-{\mu}m$ CMOS technology in a supply voltage of 1.8 V. The result show that total number of gate is 250K and it has a data processing rate of 5.1Gb/s at a clock frequency of 400MHz. The proposed area-efficient architecture can be readily applied to the next generation FEC devices for high-speed optical communications as well as wireless communications.

Hardware Design of High Performance HEVC Deblocking Filter for UHD Videos (UHD 영상을 위한 고성능 HEVC 디블록킹 필터 설계)

  • Park, Jaeha;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.1
    • /
    • pp.178-184
    • /
    • 2015
  • This paper proposes a hardware architecture for high performance Deblocking filter(DBF) in High Efficiency Video Coding for UHD(Ultra High Definition) videos. This proposed hardware architecture which has less processing time has a 4-stage pipelined architecture with two filters and parallel boundary strength module. Also, the proposed filter can be used in low-voltage design by using clock gating architecture in 4-stage pipeline. The segmented memory architecture solves the hazard issue that arises when single port SRAM is accessed. The proposed order of filtering shortens the delay time that arises when storing data into the single port SRAM at the pre-processing stage. The DBF hardware proposed in this paper was designed with Verilog HDL, and was implemented with 22k logic gates as a result of synthesis using TSMC 0.18um CMOS standard cell library. Furthermore, the dynamic frequency can process UHD 8k($7680{\times}4320$) samples@60fps using a frequency of 150MHz with an 8K resolution and maximum dynamic frequency is 285MHz. Result from analysis shows that the proposed DBF hardware architecture operation cycle for one process coding unit has improved by 32% over the previous one.

A New Hardware Design for Generating Digital Holographic Video based on Natural Scene (실사기반 디지털 홀로그래픽 비디오의 실시간 생성을 위한 하드웨어의 설계)

  • Lee, Yoon-Hyuk;Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.49 no.11
    • /
    • pp.86-94
    • /
    • 2012
  • In this paper we propose a hardware architecture of high-speed CGH (computer generated hologram) generation processor, which particularly reduces the number of memory access times to avoid the bottle-neck in the memory access operation. For this, we use three main schemes. The first is pixel-by-pixel calculation rather than light source-by-source calculation. The second is parallel calculation scheme extracted by modifying the previous recursive calculation scheme. The last one is a fully pipelined calculation scheme and exactly structured timing scheduling by adjusting the hardware. The proposed hardware is structured to calculate a row of a CGH in parallel and each hologram pixel in a row is calculated independently. It consists of input interface, initial parameter calculator, hologram pixel calculators, line buffer, and memory controller. The implemented hardware to calculate a row of a $1,920{\times}1,080$ CGH in parallel uses 168,960 LUTs, 153,944 registers, and 19,212 DSP blocks in an Altera FPGA environment. It can stably operate at 198MHz. Because of the three schemes, the time to access the external memory is reduced to about 1/20,000 of the previous ones at the same calculation speed.

(Theoretical Performance analysis of 12Mbps, r=1/2, k=7 Viterbi deocder and its implementation using FPGA for the real time performance evaluation) (12Mbps, r=1/2, k=7 비터비 디코더의 이론적 성능분석 및 실시간 성능검증을 위한 FPGA구현)

  • Jeon, Gwang-Ho;Choe, Chang-Ho;Jeong, Hae-Won;Im, Myeong-Seop
    • Journal of the Institute of Electronics Engineers of Korea SC
    • /
    • v.39 no.1
    • /
    • pp.66-75
    • /
    • 2002
  • For the theoretical performance analysis of Viterbi Decoder for wireless LAN with data rate 12Mbps, code rate 1/2 and constraint length 7 defined in IEEE 802.11a, the transfer function is derived using Cramer's rule and the first-event error probability and bit error probability is derived under the AWGN. In the design process, input symbol is quantized into 16 steps for 4 bit soft decision and register exchange method instead of memory method is proposed for trace back, which enables the majority at the final decision stage. In the implementation, the Viterbi decoder based on parallel architecture with pipelined scheme for processing 12Mbps high speed data rate and AWGN generator are implemented using FPGA chips. And then its performance is verified in real time.

Trace-Back Viterbi Decoder with Sequential State Transition Control (순서적 역방향 상태천이 제어에 의한 역추적 비터비 디코더)

  • 정차근
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.40 no.11
    • /
    • pp.51-62
    • /
    • 2003
  • This paper presents a novel survivor memeory management and decoding techniques with sequential backward state transition control in the trace back Viterbi decoder. The Viterbi algorithm is an maximum likelihood decoding scheme to estimate the likelihood of encoder state for channel error detection and correction. This scheme is applied to a broad range of digital communication such as intersymbol interference removing and channel equalization. In order to achieve the area-efficiency VLSI chip design with high throughput in the Viterbi decoder in which recursive operation is implied, more research is required to obtain a simple systematic parallel ACS architecture and surviver memory management. As a method of solution to the problem, this paper addresses a progressive decoding algorithm with sequential backward state transition control in the trace back Viterbi decoder. Compared to the conventional trace back decoding techniques, the required total memory can be greatly reduced in the proposed method. Furthermore, the proposed method can be implemented with a simple pipelined structure with systolic array type architecture. The implementation of the peripheral logic circuit for the control of memory access is not required, and memory access bandwidth can be reduced Therefore, the proposed method has characteristics of high area-efficiency and low power consumption with high throughput. Finally, the examples of decoding results for the received data with channel noise and application result are provided to evaluate the efficiency of the proposed method.

Optimized Hardware Design of Deblocking Filter for H.264/AVC (H.264/AVC를 위한 디블록킹 필터의 최적화된 하드웨어 설계)

  • Jung, Youn-Jin;Ryoo, Kwang-Ki
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.47 no.1
    • /
    • pp.20-27
    • /
    • 2010
  • This paper describes a design of 5-stage pipelined de-blocking filter with power reduction scheme and proposes a efficient memory architecture and filter order for high performance H.264/AVC Decoder. Generally the de-blocking filter removes block boundary artifacts and enhances image quality. Nevertheless filter has a few disadvantage that it requires a number of memory access and iterated operations because of filter operation for 4 time to one edge. So this paper proposes a optimized filter ordering and efficient hardware architecture for the reduction of memory access and total filter cycles. In proposed filter parallel processing is available because of structured 5-stage pipeline consisted of memory read, threshold decider, pre-calculation, filter operation and write back. Also it can reduce power consumption because it uses a clock gating scheme which disable unnecessary clock switching. Besides total number of filtering cycle is decreased by new filter order. The proposed filter is designed with Verilog-HDL and functionally verified with the whole H.264/AVC decoder using the Modelsim 6.2g simulator. Input vectors are QCIF images generated by JM9.4 standard encoder software. As a result of experiment, it shows that the filter can make about 20% total filter cycles reduction and it requires small transposition buffer size.

Performance Analysis of Implementation on Image Processing Algorithm for Multi-Access Memory System Including 16 Processing Elements (16개의 처리기를 가진 다중접근기억장치를 위한 영상처리 알고리즘의 구현에 대한 성능평가)

  • Lee, You-Jin;Kim, Jea-Hee;Park, Jong-Won
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.49 no.3
    • /
    • pp.8-14
    • /
    • 2012
  • Improving the speed of image processing is in great demand according to spread of high quality visual media or massive image applications such as 3D TV or movies, AR(Augmented reality). SIMD computer attached to a host computer can accelerate various image processing and massive data operations. MAMS is a multi-access memory system which is, along with multiple processing elements(PEs), adequate for establishing a high performance pipelined SIMD machine. MAMS supports simultaneous access to pq data elements within a horizontal, a vertical, or a block subarray with a constant interval in an arbitrary position in an $M{\times}N$ array of data elements, where the number of memory modules(MMs), m, is a prime number greater than pq. MAMS-PP4 is the first realization of the MAMS architecture, which consists of four PEs in a single chip and five MMs. This paper presents implementation of image processing algorithms and performance analysis for MAMS-PP16 which consists of 16 PEs with 17 MMs in an extension or the prior work, MAMS-PP4. The newly designed MAMS-PP16 has a 64 bit instruction format and application specific instruction set. The author develops a simulator of the MAMS-PP16 system, which implemented algorithms can be executed on. Performance analysis has done with this simulator executing implemented algorithms of processing images. The result of performance analysis verifies consistent response of MAMS-PP16 through the pyramid operation in image processing algorithms comparing with a Pentium-based serial processor. Executing the pyramid operation in MAMS-PP16 results in consistent response of processing time while randomly response time in a serial processor.