• Title/Summary/Keyword: parallel processing

Search Result 2,100, Processing Time 0.033 seconds

VLSI architecture design of CAVLC entropy encoder/decoder for H.264/AVC (H.264/AVC를 위한 CAVLC 엔트로피 부/복호화기의 VLSI 설계)

  • Lee Dae-joon;Jeong Yong-jin
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.30 no.5C
    • /
    • pp.371-381
    • /
    • 2005
  • In this paper, we propose an advanced hardware architecture for the CAVLC entropy encoder/decoder engine for real time video compression. The CAVLC (Context-based Adaptive Variable Length Coding) is a lossless compression method in H.264/AVC and it has high compression efficiency but has computational complexity. The reference memory size is optimized using partitioned storing method and memory reuse method which are based on partiality of memory referencing. We choose the hardware architecture which has the most suitable one in several encoder/decoder architectures for the mobile devices and improve its performance using parallel processing. The proposed architecture has been verified by ARM-interfaced emulation board using Altera Excalibur and also synthesized on Samsung 0.18 um CMOS technology. The synthesis result shows that the encoder can process about 300 CIF frames/s at 150MHz and the decoder can process about 250 CIF frames/s at 140Mhz. The hardware architectures are being used as core modules when implementing a complete H.264/AVC video encoder/decoder chip for real-time multimedia application.

A Study on the Design of a RISC core with DSP Support (DSP기능을 강화한 RISC 프로세서 core의 ASIC 설계 연구)

  • 김문경;정우경;이용석;이광엽
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.26 no.11C
    • /
    • pp.148-156
    • /
    • 2001
  • This paper proposed embedded application-specific microprocessor(YS-RDSP) whose structure has an additional DSP processor on chip. The YS-RDSP can execute maximum four instructions in parallel. To make program size shorter, 16-bit and 32-bit instruction lengths are supported in YS-RDSP. The YS-RDSP provides programmability. controllability, DSP processing ability, and includes eight-kilobyte on-chip ROM and eight-kilobyte RAM. System controller on the chip gives three power-down modes for low-power operation, and SLEEP instruction changes operation statue of CPU core and peripherals. YS-RDSP processor was implemented with Verilog HDL on top-down methodology, and it was improved and verified by cycle-based simulator written in C-language. The verified model was synthesized with 0.7um, 3.3V CMOS standard cell library, and the layout size was 10.7mm78.4mm which was implemented by using automatic P&R software.

  • PDF

On Designing 4-way Superscalar Digital Signal Processor Core (4-way 수퍼 스칼라 디지털 시그널 프로세서 코어 설계)

  • 김준석;유선국;박성욱;정남훈;고우석;이근섭;윤대희
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.23 no.6
    • /
    • pp.1409-1418
    • /
    • 1998
  • The recent audio CODEC(Coding/Decoding) algorithms are complex of several coding techniques, and can be divided into DSP tasks, controller tasks and mixed tasks. The traditional DSP processor has been designed for fast processing of DSP tasks only, but not for controller and mixed tasks. This paper presents a new architecture that achieves high throughput on both controller and mixed tasks of such algorithms while maintaining high performance for DSP tasks. The proposed processor, YSP-3, operates four algorithms while maintaining high performance for DSP tasks. The proposed processor, YSP-3, operates functional units (Multiplier, two ALUs, Load/Store Unit) in parallel via 4-issue super-scalar instruction structure. The performance evaluation of YSP-3 has been done through the implementation of the several DSP algorithms and the part of the AC-3 decoding algorithms.

  • PDF

Implementation of a G,723.1 Annex A Using a High Performance DSP (고성능 DSP를 이용한 G.723.1 Annex A 구현)

  • 최용수;강태익
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.7
    • /
    • pp.648-655
    • /
    • 2002
  • This paper describes implementation of a multi-channel G.723.1 Annex A (G.723.1A) focused on code optimization using a high performance general purpose Digital Signal Processor (DSP), To implement a multi-channel G.723.1A functional complexities of the ITU-T G.723.1A fixed-point C-code are measures an analyzed. Then we sort and optimize C functions in complexity order. In parallel with optimization, we verify the bit-exactness of the optimized code using the ITU-T test vectors. Using only internal memory, the optimized code can perform full-duplex 17 channel processing. In addition, we further increase the number of available channels per DSP into 22 using fast codebook search algorithms, referred to as bit -compatible optimization.

Three-dimensional Wave Propagation Modeling using OpenACC and GPU (OpenACC와 GPU를 이용한 3차원 파동 전파 모델링)

  • Kim, Ahreum;Lee, Jongwoo;Ha, Wansoo
    • Geophysics and Geophysical Exploration
    • /
    • v.20 no.2
    • /
    • pp.72-77
    • /
    • 2017
  • We calculated 3D frequency- and Laplace-domain wavefields using time-domain modeling and Fourier transform or Laplace transform. We adopted OpenACC and GPU for an efficient parallel calculation. The OpenACC makes it easy to use GPU accelerators by adding directives in conventional C, C++, and Fortran programming languages. Accordingly, one doesn't have to learn new GPGPU programming languages such as CUDA or OpenCL to use GPU. An OpenACC program allocates GPU memory, transfers data between the host CPU and GPU devices and performs GPU operations automatically or following user-defined directives. We compared performance of 3D wave propagation modeling programs using OpenACC and GPU to that using single-core CPU through numerical tests. Results using a homogeneous model and the SEG/EAGE salt model show that the OpenACC programs are approximately 53 and 30 times faster than those using single-core CPU.

The Study of fire Driven flow and Smoke Exhaust Efficiency for PSD Installation Subway Station (PSD 설치역사의 화재유동 및 배연 효율 연구)

  • Jang, Yong-Jun;Lee, Chang-Hyun;Kim, Hag-Beom;Kim, Jin-Ho
    • Proceedings of the KSR Conference
    • /
    • 2009.05a
    • /
    • pp.1054-1061
    • /
    • 2009
  • This research was performed with emphasis on fire driven flow behavior and smoke exhaust efficiency which depend on the presence of PSD which are being installed domestically and overseas. For simulation, Jung-ang-ro station of Dae-gu subway station was chosen as model, and fire driven flow analysis was performed by using FDS as flow analysis code. Since many calculation time are required for calculation due to increase in the number of grid as the entire station is modeled, simulation was conducted in parallel processing technique. The fire driven flow analysis was analyzed case by case with composing fire scenario to compare fire driven flow and smoke exhaust efficiency changes depending on the presence of PSD. For fire scale, fire strength of 10MW was studied by referring to NFPA-l30. The calculation results were analyzed with focus on passenger safety by referring to NFPA-130.

  • PDF

Stereo Matching by Dynamic Programming with Edges Emphasized (에지 정보를 강조한 동적계획법에 의한 스테레오 정합)

  • Joo, Jae-Heum;Oh, Jong-kyu;Seol, Sung-Wook;Lee, Chul-Hun;Nam, Ki-Gon
    • Journal of the Korean Institute of Telematics and Electronics S
    • /
    • v.36S no.10
    • /
    • pp.123-131
    • /
    • 1999
  • In this paper, we proposed stereo matching algorithm by dynamic programming with edges emphasized. Existing algorithms show blur generally at depth discontinuities owing to smoothness constraint and non-existence of matching pixel in occlusion regions. Also it accompanies matching error by lackness of matching information in the untextured regions. This paper defines new cost function to make up for the problems occurred to existing algorithms. It is possible through deriving matching of edges in left and right images to be carried out between edge regions anf deriving that in the other regions to be peformed between the other regions. In case of the possibility that edges can be Produced in a large amount, matching between edge information adds weight to cost function in proportion to Path distance. Proposed algorithm was applied to various images obtained by convergent camera model as well as parallel camera model. As the result, proposed algorithm showed improved performance in the aspect of matching error and processing in the occlusion regions compared to existing algorithms. Also it could improve blur especially in discontinuity regions.

  • PDF

A New Complex-Number Multiplication Algorithm using Radix-4 Booth Recoding and RB Arithmetic, and a 10-bit CMAC Core Design (Radix-4 Booth Recoding과 RB 연산을 이용한 새로운 복소수 승산 알고리듬 및 10-bit CMAC코어 설계)

  • 김호하;신경욱
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.9
    • /
    • pp.11-20
    • /
    • 1998
  • High-speed complex-number arithmetic units are essential to baseband signal processing of modern digital communication systems such as channel equalization, timing recovery, modulation and demodulation. In this paper, a new complex-number multiplication algorithm is proposed, which is based on redundant binary (RB) arithmetic combined with radix-4 Booth recoding scheme. The proposed algorithm reduces the number of partial product by one-half as compared with the conventional direct method using real-number multipliers and adders. It also leads to a highly parallel architecture and simplified circuit, resulting in high-speed operation and low power dissipation. To demonstrate the proposed algorithm, a prototype complex-number multiplier-accumulator (CMAC) core with 10-bit operands has been designed using 0.8-$\mu\textrm{m}$ N-Well CMOS technology. The designed CMAC core contains about 18,000 transistors on the area of about 1.60 ${\times}$ 1.93 $\textrm{mm}^2$. The functional and speed test results show that it can operate with 120-MHz clock at V$\sub$DD/=3.3-V, and its power consumption is given to about 63-mW.

  • PDF

A Hardware Implementation of Pyramidal KLT Feature Tracker (계층적 KLT 특징 추적기의 하드웨어 구현)

  • Kim, Hyun-Jin;Kim, Gyeong-Hwan
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.46 no.2
    • /
    • pp.57-64
    • /
    • 2009
  • This paper presents the hardware implementation of the pyramidal KLT(Kanade-Lucas-Tomasi) feature tracker. Because of its high computational complexity, it is not easy to implement a real-time KLT feature tracker using general-purpose processors. A hardware implementation of the pyramidal KLT feature tracker using FPGA(Field Programmable Gate Array) is described in this paper with emphasis on 1) adaptive adjustment of threshold in feature extraction under diverse lighting conditions, and 2) modification of the tracking algorithm to accomodate parallel processing and to overcome memory constraints such as capacity and bandwidth limitation. The effectiveness of the implementation was evaluated over ones produced by its software implementation. The throughput of the FPGA-based tracker was 30 frames/sec for video images with size of $720{\times}480$.

Design and Implementation of Accelerator Architecture for Binary Weight Network on FPGA with Limited Resources (한정된 자원을 갖는 FPGA에서의 이진가중치 신경망 가속처리 구조 설계 및 구현)

  • Kim, Jong-Hyun;Yun, SangKyun
    • Journal of IKEEE
    • /
    • v.24 no.1
    • /
    • pp.225-231
    • /
    • 2020
  • In this paper, we propose a method to accelerate BWN based on FPGA with limited resources for embedded system. Because of the limited number of logic elements available, a single computing unit capable of handling Conv-layer, FC-layer of various sizes must be designed and reused. Also, if the input feature map can not be parallel processed at one time, the output must be calculated by reading the inputs several times. Since the number of available BRAM modules is limited, the number of data bits in the BWN accelerator must be minimized. The image classification processing time of the BWN accelerator is superior when compared with a embedded CPU and is faster than a desktop PC and 50% slower than a GPU system. Since the BWN accelerator uses a slow clock of 50MHz, it can be seen that the BWN accelerator is advantageous in performance versus power.