• Title/Summary/Keyword: Parallel pipeline

Search Result 172, Processing Time 0.029 seconds

High Performance Coprocessor Architecture for Real-Time Dense Disparity Map (실시간 Dense Disparity Map 추출을 위한 고성능 가속기 구조 설계)

  • Kim, Cheong-Ghil;Srini, Vason P.;Kim, Shin-Dug
    • The KIPS Transactions:PartA
    • /
    • v.14A no.5
    • /
    • pp.301-308
    • /
    • 2007
  • This paper proposes high performance coprocessor architecture for real time dense disparity computation based on a phase-based binocular stereo matching technique called local weighted phase-correlation(LWPC). The algorithm combines the robustness of wavelet based phase difference methods and the basic control strategy of phase correlation methods, which consists of 4 stages. For parallel and efficient hardware implementation, the proposed architecture employs SIMD(Single Instruction Multiple Data Stream) architecture for each functional stage and all stages work on pipelined mode. Such that the newly devised pipelined linear array processor is optimized for the case of row-column image processing eliminating the need for transposed memory while preserving generality and high throughput. The proposed architecture is implemented with Xilinx HDL tool and the required hardware resources are calculated in terms of look up tables, flip flops, slices, and the amount of memory. The result shows the possibility that the proposed architecture can be integrated into one chip while maintaining the processing speed at video rate.

Low-power Hardware Design of Deblocking Filter in HEVC In-loop Filter for Mobile System (모바일 시스템을 위한 저전력 HEVC 루프 내 필터의 디블록킹 필터 하드웨어 설계)

  • Park, Seungyong;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.21 no.3
    • /
    • pp.585-593
    • /
    • 2017
  • In this paper, we propose a deblocking filter hardware architecture for low-power HEVC (High-Efficiency Video Coding) in-loop for mobile systems. HEVC performs image compression on a block-by-block basis, resulting in blockage of the image due to quantization error. The deblocking filter is used to remove the blocking phenomenon in the image. Currently, UHD video service is supported in various mobile systems, but power consumption is high. The proposed low-power deblocking filter hardware structure minimizes the power consumption by blocking the clock to the internal module when the filter is not applied. It also has four parallel filter structures for high throughput at low operating frequencies and each filter is implemented in a four-stage pipeline. The proposed deblocking filter hardware structure is designed with Verilog HDL and synthesized using TSMC 65nm CMOS standard cell library, resulting in about 52.13K gates. In addition, real-time processing of 8K@84fps video is possible at 110MHz operating frequency, and operation power is 6.7mW.

An Efficient Hardware Design for Scaling and Transform Coefficients Decoding (스케일링과 변환계수 복호를 위한 효율적인 하드웨어 설계)

  • Jung, Hongkyun;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.10
    • /
    • pp.2253-2260
    • /
    • 2012
  • In this paper, an efficient hardware architecture is proposed for inverse transform and inverse quantization of H.264/AVC decoder. The previous inverse transform and quantization architecture has a different AC and DC coefficients decoding order. In the proposed architecture, IQ is achieved after IT regardless of the DC or AC coefficients. A common operation unit is also proposed to reduce the computational complexity of inverse quantization. Since division operation is included in the previous architecture, it will generate errors if the processing order is changed. In order to solve the problem, the division operation is achieved after IT to prevent errors in the proposed architecture. The architecture is implemented with 3-stage pipeline and a parallel vertical and horizontal IDCT is also implemented to reduce the operation cycle. As a result of analyzing the proposed ITIQ architecture operation cycle for one macroblock, the proposed one has improved by 45% than the previous one.

Design of Asynchronous 16-Bit Divider Using NST Algorithm (NST알고리즘을 이용한 비동기식 16비트 제산기 설계)

  • 이우석;박석재;최호용
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.40 no.3
    • /
    • pp.33-42
    • /
    • 2003
  • This paper describes an efficient design of an asynchronous 16-bit divider using the NST (new Svoboda-Tung) algorithm. The divider is designed to reduce power consumption by using the asynchronous design scheme in which the division operation is performed only when it is requested. The divider consists of three blocks, i.e. pre-scale block, iteration step block, and on-the-fly converter block using asynchronous pipeline structure. The pre-scale block is designed using a new subtracter to have small area and high performance. The iteration step block consists of an asynchronous ring structure with 4 division steps for area reduction. In other to reduce hardware overhead, the part related to critical path is designed by a dual-rail circuit, and the other part is done by a single-rail circuit in the ring structure. The on-the-fly converter block is designed for high performance using the on-the-fly algorithm that enables parallel operation with iteration step block. The design results with 0.6${\mu}{\textrm}{m}$ CMOS process show that the divider consists of 12,956 transistors with 1,480 $\times$1,200${\mu}{\textrm}{m}$$^2$area and average-case delay is 41.7㎱.

ASIC Design of OpenRISC-based Multimedia SoC Platform (OpenRISC 기반 멀티미디어 SoC 플랫폼의 ASIC 설계)

  • Kim, Sun-Chul;Ryoo, Kwang-Ki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2008.10a
    • /
    • pp.281-284
    • /
    • 2008
  • This paper describes ASIC design of multimedia SoC Platform. The implemented Platform consists of 32-bit OpenRISC1200 Microprocessor, WISHBONE on-chip bus, VGA Controller, Debug Interface, SRAM Interface and UART. The 32-bit OpenRISC1200 processor has 5 stage pipeline and Harvard architecture with separated instruction/data bus. The VGA Controller can display RCB data on a CRT or LCD monitor. The Debug Interface supports a debugging function for the Platform. The SRAM Interface supports 18-bit address bus and 32-bit data bus. The UART provides RS232 protocol, which supports serial communication function. The Platform is design and verified on a Xilinx VERTEX-4 XC4VLX80 FPGA board. Test code is generated by a cross compiler' and JTAG utility software and gdb are used to download the test code to the FPGA board through parallel cable. Finally, the Platform is implemented into a single ASIC chip using Chatered 0.18um process and it can operate at 100MHz clock frequency.

  • PDF

A New Architecture of High-Performance Digital Hologram Generator based on Independent Calculation of a Holographic Pixel (독립적 홀로그램 화소 연산 방식의 고성능 디지털 홀로그램 생성기의 하드웨어 구조)

  • Lee, Yoon-Huyk;Seo, Young-Ho;Choi, Hyun-Jun;Kim, Dong-Wook
    • Journal of Broadcast Engineering
    • /
    • v.16 no.3
    • /
    • pp.403-415
    • /
    • 2011
  • In this paper, we proposed a hardware architecture to generate digital holograms at high speed. It used the modified computer-generated hologram (CGH) algorithm and adapted the pipeline-based hardware to be able to remove memory bottleneck problem. It uses not the method which generates a hologram by accumulating intermittent holograms but the one which independently generates a pixel of a final hologram and uses the appropriate CGH algorithm for the selected method. Based on the CGH algorithm we proposed the architecture of the digital hologram generator which consists of input interface part, calculating part, and normalizing part. The hardware can decrease memory usage because it repeatedly use object light sources which is stored in the internal buffer. It is also operationally parallelized by vertically adding unit cells. It can generate 86 frames of HD digital hologram per 1 second for 1K light sources.

Low-Latency Programmable Look-Up Table Routing Engine for Parallel Computers (병렬 컴퓨터를 위한 저지연 프로그램형 조견표 경로지정 엔진)

  • Chang, Nae-Hyuck
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.6 no.2
    • /
    • pp.244-253
    • /
    • 2000
  • Since no single routing-switching combination performs the best under various different types of applications, a flexible network is required to support a range of polices. This paper introduces an implementation of a look-up table routing engine offering flexible routing and switching polices without performance degradation unlike those based on microprocessors. By deciding contents of look-up tables, the engine can implement wormhole routing, virtual cut-through routing, and packet switching, as well as hybrid switching, under a variety of routing algorithms. Since the routing engine has a piplelined look-up table architecture, the routing delay is as small as one flit, and thus it can overlap multiple routing actions without performance degradation in comparison with hardwired routers dedicated to a specific policy. Because four pipeline stages do not induce a hazard, expensive forwarding logic is not required. The routing engine can accommodate four physical links with a time shared cut-through bus or single link with a cross-bar switch. It is implemented using Xilinx 4000 series FPGA.

  • PDF

A Vectorization Technique at Object Code Level (목적 코드 레벨에서의 벡터화 기법)

  • Lee, Dong-Ho;Kim, Ki-Chang
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.5
    • /
    • pp.1172-1184
    • /
    • 1998
  • ILP(Instruction Level Parallelism) processors use code reordering algorithms to expose parallelism in a given sequential program. When applied to a loop, this algorithm produces a software-pipelined loop. In a software-pipelined loop, each iteration contains a sequence of parallel instructions that are composed of data-independent instructions collected across from several iterations. For vector loops, however the software pipelining technique can not expose the maximum parallelism because it schedules the program based only on data-dependencies. This paper proposes to schedule differently for vector loops. We develop an algorithm to detect vector loops at object code level and suggest a new vector scheduling algorithm for them. Our vector scheduling improves the performance because it can schedule not only based on data-dependencies but on loop structure or iteration conditions at the object code level. We compare the resulting schedules with those by software-pipelining techniques in the aspect of performance.

  • PDF

Analysis of GPU Performance and Memory Efficiency according to Task Processing Units (작업 처리 단위 변화에 따른 GPU 성능과 메모리 접근 시간의 관계 분석)

  • Son, Dong Oh;Sim, Gyu Yeon;Kim, Cheol Hong
    • Smart Media Journal
    • /
    • v.4 no.4
    • /
    • pp.56-63
    • /
    • 2015
  • Modern GPU can execute mass parallel computation by exploiting many GPU core. GPGPU architecture, which is one of approaches exploiting outstanding computational resources on GPU, executes general-purpose applications as well as graphics applications, effectively. In this paper, we investigate the impact of memory-efficiency and performance according to number of CTAs(Cooperative Thread Array) on a SM(Streaming Multiprocessors), since the analysis of relation between number of CTA on a SM and them provides inspiration for researchers who study the GPU to improve the performance. Our simulation results show that almost benchmarks increasing the number of CTAs on a SM improve the performance. On the other hand, some benchmarks cannot provide performance improvement. This is because the number of CTAs generated from same kernel is a little or the number of CTAs executed simultaneously is not enough. To precisely classify the analysis of performance according to number of CTA on a SM, we also analyze the relations between performance and memory stall, dram stall due to the interconnect congestion, pipeline stall at the memory stage. We expect that our analysis results help the study to improve the parallelism and memory-efficiency on GPGPU architecture.

A Study on the Instruction Set Architecture of Multimedia Extension Processor (멀티미디어 확장 프로세서의 명령어 집합 구조에 관한 연구)

  • O, Myeong-Hun;Lee, Dong-Ik;Park, Seong-Mo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.38 no.6
    • /
    • pp.420-435
    • /
    • 2001
  • As multimedia technology has rapidly grown recently, many researches to process multimedia data efficiently using general-purpose processors have been studied. In this paper, we proposed multimedia instructions which can process multimedia data effectively, and suggested a processor architecture for those instructions. The processor was described with Verilog-HDL in the behavioral level and simulated with CADENCE$^{TM}$ tool. Proposed multimedia instructions are total 48 instructions which can be classified into 7 groups. Multimedia data have 64-bit format and are processed as parallel subwords of 8-bit 8 bytes, 16-bit 4 half words or 32-bit 2 words. Modeled processor is developed based on the Integer Unit of SPARC V.9. It has five-stage pipeline RISC architecture with Harvard principle.e.

  • PDF