• Title/Summary/Keyword: Parallel pipeline

Search Result 172, Processing Time 0.026 seconds

A Software Method for Improving the Performance of Real-time Rendering of 3D Games (3D 게임의 실시간 렌더링 속도 향상을 위한 소프트웨어적 기법)

  • Whang, Suk-Min;Sung, Mee-Young;You, Yong-Hee;Kim, Nam-Joong
    • Journal of Korea Game Society
    • /
    • v.6 no.4
    • /
    • pp.55-61
    • /
    • 2006
  • Graphics rendering pipeline (application, geometry, and rasterizer) is the core of real-time graphics which is the most important functionality for computer games. Usually this rendering process is completed by both the CPU and the GPU, and a bottleneck can be located either in the CPU or the GPU. This paper focuses on reducing the bottleneck between the CPU and the GPU. We are proposing a method for improving the performance of parallel processing for real-time graphics rendering by separating the CPU operations (usually performed using a thread) into two parts: pure CPU operations and operations related to the GPU, and let them operate in parallel. This allows for maximizing the parallelism in processing the communication between the CPU and the GPU. Some experiments lead us to confirm that our method proposed in this paper can allow for faster graphics rendering. In addition to our method of using a dedicated thread for GPU related operations, we are also proposing an algorithm for balancing the graphics pipeline using the idle time due to the bottleneck. We have implemented the two methods proposed in this paper in our networked 3D game engine and verified that our methods are effective in real systems.

  • PDF

Parallel Computation of FDTD algorithm using CUDA (CUDA를 이용한 FDTD 알고리즘의 병렬처리)

  • Lee, Ho-Young;Park, Jong-Hyun;Kim, Jun-Seong
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.47 no.4
    • /
    • pp.82-87
    • /
    • 2010
  • Modern GPUs(Graphic Processing Units) provide computing capability higher than that of the general CPUs(Central Processor Units). With supports of programmability of graphics pipeline GP-GPU(General Purpose computation on GPU) has gained much attention expanding its application area. This paper compares sequential and massively parallel implementations of FDTD(Finite Difference Time Domain) algorithm using CUDA(Compute Unified Device Architecture). Experimental results show upto 45X speedup over conventional CPU execution.

Parallel Pipeline Architecture of H.264 Decoder and U-Chip Based on Parallel Array (병렬 어레이 프로세서 기반 U-Chip 및 H.264 디코더의 병렬 파이프라인 구조)

  • Suk, Jung-Hee;Lyuh, Chun-Gi;Roh, Tae Moon
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2013.11a
    • /
    • pp.161-164
    • /
    • 2013
  • 본 논문에서는 다양한 멀티미디어 코덱을 고속으로 처리하기 위하여 전용하드웨어가 아닌 병렬 어레이 프로세서 기반의 U-Chip(Universal-Chip) 구조를 제안하고 TSMC 80nm 공정을 사용하여 11,865,090개의 게이트 수를 가지는 칩으로 개발하였다. U-Chip은 역양자화(IQ), 역변환(IT), 움직임 보상(MC) 연산을 위한 $4{\times}16$ 개의 프로세싱 유닛으로 구성된 병렬 어레이 프로세서와 문맥적응적 가변길이디코딩(CAVLC)을 위한 비트스트림 프로세서와 인트라 예측(IP), 디블록킹필터(DF) 연산을 위한 순차 프로세서와 DMAC의 데이터 전송 및 각 프로세서를 제어하여 병렬 파이프라인 스케쥴링을 처리하는 시퀀서 프로세서 등으로 구성된다. 1개의 프로세싱 유닛에 1개의 매크로블록 데이터를 맵핑하여 총 64개의 매크로블록을 병렬처리 하였다. 64개 매크로블록의 대용량 데이터 전송 시간과 각 프로세서들의 연산을 동시에 병렬 파이프라인 함으로서 전체 연산 성능을 높일 수 있는 이점이 있다. 병렬 파이프라인 구조의 H.264 디코더 프로그램을 개발하였고 제작된 U-Chip을 통해 $720{\times}480$ 크기의 베이스라인 프로파일 영상에 대하여 코어 192MHz 동작, DDR 메모리 96MHz 동작에서 30fps의 처리율을 가짐을 확인하였다.

  • PDF

A 32-bit Microprocessor with enhanced digital signal process functionality (디지털 신호처리 기능을 강화한 32비트 마이크로프로세서)

  • Moon, Sang-ook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • v.9 no.2
    • /
    • pp.820-822
    • /
    • 2005
  • We have designed a 32-bit microprocessor with fixed point digital signal processing functionality. This processor, combines both general-purpose microprocessor and digital signal processor functionality using the reduced instruction set computer design principles. It has functional units for arithmetic operation, digital signal processing and memory access. They operate in parallel in order to remove stall cycles after DSP or load/store instructions, which usually need one or more issue latency cycles in addition to the first issue cycle. High performance was achieved with these parallel functional units while adopting a sophisticated five-stage pipeline stucture.

  • PDF

A New Algorithm and High-Performance Hardware Design for 2-Dimensional Parallel Generation of Digital Hologram (디지털 홀로그램의 2차원적인 병렬 생성을 위한 알고리즘 및 고성능 하드웨어 설계)

  • Yang, Wol-Sung;Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.16 no.1
    • /
    • pp.133-142
    • /
    • 2012
  • In this paper, we propose and implement a high-speed algorithm for CGH that is to calculate digital hologram by modeling the interference phenomenon for tow lights. This algorithm changes the computation equations into a parallel-computable ones and implements it with a structure consisting of two kinds of cells (initial calculation cell, and update calculation cell). The parallel computation algorithm is to get the rest hologram pixels concurrently after calculation the first hologram column. Here, the initial calculation cells compute the first column of the hologram and the update calculation cells compute the rest of the hologram. The two kinds of cells performs a pipeline operation to complete the operations of the two cells at the same time. A CGH calculator to compute the hole hologram for a light source is structured by arranging the two kinds of cells. Results from simulation showed that the maximum operation frequency is about 215MHz. So, experiments are performed by setting this frequency and the same environments as the method showing the best performance. As the results, the proposed one could complete the computation of 81.75 CGH frames per second, while the previous method computes 62.9 CGH frames per second.

Design of Image Extraction Hardware for Hand Gesture Vision Recognition

  • Lee, Chang-Yong;Kwon, So-Young;Kim, Young-Hyung;Lee, Yong-Hwan
    • Journal of Advanced Information Technology and Convergence
    • /
    • v.10 no.1
    • /
    • pp.71-83
    • /
    • 2020
  • In this paper, we propose a system that can detect the shape of a hand at high speed using an FPGA. The hand-shape detection system is designed using Verilog HDL, a hardware language that can process in parallel instead of sequentially running C++ because real-time processing is important. There are several methods for hand gesture recognition, but the image processing method is used. Since the human eye is sensitive to brightness, the YCbCr color model was selected among various color expression methods to obtain a result that is less affected by lighting. For the CbCr elements, only the components corresponding to the skin color are filtered out from the input image by utilizing the restriction conditions. In order to increase the speed of object recognition, a median filter that removes noise present in the input image is used, and this filter is designed to allow comparison of values and extraction of intermediate values at the same time to reduce the amount of computation. For parallel processing, it is designed to locate the centerline of the hand during scanning and sorting the stored data. The line with the highest count is selected as the center line of the hand, and the size of the hand is determined based on the count, and the hand and arm parts are separated. The designed hardware circuit satisfied the target operating frequency and the number of gates.

A Low-Power LSI Design of Japanese Word Recognition System

  • Yoshizawa, Shingo;Miyanaga, Yoshikazu;Wada, Naoya;Yoshida, Norinobu
    • Proceedings of the IEEK Conference
    • /
    • 2002.07a
    • /
    • pp.98-101
    • /
    • 2002
  • This paper reports a parallel architecture in a HMM based speech recognition system for a low-power LSI design. The proposed architecture calculates output probability of continuous HMM (CHMM) by using concurrent and pipeline processing. They enable to reduce memory access and have high computing efficiency. The novel point is the efficient use of register arrays that reduce memory access considerably compared with any conventional method. The implemented system can achieve a real time response with lower clock in a middle size vocabulary recognition task (100-1000 words) by using this technique.

  • PDF

A Study on the Design of FFT Architecture for Ultra-Wide Band OFDM Communication System (UWB OFDM 통신 시스템 용 FFT(Fast Fourier Transform) 설계에 관한 연구)

  • Park Kye-Wan;Yoon Sang-hun;Chong Jong-Wha
    • Proceedings of the IEEK Conference
    • /
    • 2004.06a
    • /
    • pp.309-312
    • /
    • 2004
  • This paper proposes the architecture of UWB OFDM communication system. More high data rate is requested in the 128-point FFT/IFFT of the UWB OFDM communication system than the conventional communication systems. So, the proposed architecture uses pipeline and parallel architecture. For a highly efficient architecture, the optimal clipping power and the input quantization bits are found in simulation. The hardware complexity of the proposed architecture is presented is consideration of Adder, Register and Complex Multiplier.

  • PDF

A Low Power Dual CDS for a Column-Parallel CMOS Image Sensor

  • Cho, Kyuik;Kim, Daeyun;Song, Minkyu
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.12 no.4
    • /
    • pp.388-396
    • /
    • 2012
  • In this paper, a $320{\times}240$ pixel, 80 frame/s CMOS image sensor with a low power dual correlated double sampling (CDS) scheme is presented. A novel 8-bit hold-and-go counter in each column is proposed to obtain 10-bit resolution. Furthermore, dual CDS and a configurable counter scheme are also discussed to realize efficient power reduction. With these techniques, the digital counter consumes at least 43% and at most 61% less power compared with the column-counters type, and the frame rate is approximately 40% faster than the double memory type due to a partial pipeline structure without additional memories. The prototype sensor was fabricated in a Samsung $0.13{\mu}m$ 1P4M CMOS process and used a 4T APS with a pixel pitch of $2.25{\mu}m$. The measured column fixed pattern noise (FPN) is 0.10 LSB.

Prediction of Chlorine Concentration in a Pilot-Scaled Plant Distribution System (Pilot 규모의 모의 관망에서의 염소 농도 예측)

  • Kim, Hyun Jun;Kim, Sang Hyun
    • Journal of Korean Society of Water and Wastewater
    • /
    • v.26 no.6
    • /
    • pp.861-869
    • /
    • 2012
  • The chlorine's residual concentration prevents the regrowth of microorganism in water transport along the pipeline system. Precise prediction of chlorine concentration is important in determining disinfectant injection for the water distribution system. In this study, a pilot scale water distribution system was designed and fabricated to measure the temporal variation of chlorine concentration for three flow conditions (V = 0.88, 1.33, 1.95 m/s). Various kinetic models were applied to identify the relationship between hydraulic condition and chlorine decay. Genetic Algorithm (GA) was integrated into five kinetic models and time series of chlorine were used to calibrate parameters. Model fitness was compared by Root Mean Square Error (RMSE) between measurement and prediction. Limited first order model and Parallel first order showed good fitness for prediction of chlorine concentration.