Search | Korea Science

Study of parallelization methods for real-time HEVC encoder implementation (실시간 HEVC 인코더 구현을 위한 병렬화 기법에 관한 연구)

Ahn, Yongjo;Hwang, Taejin;Lee, Dongkyu;Kim, Sangmin;Oh, Seoung-Jun;Sim, Dong-Gyu
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2013.06a
- /
- pp.119-122
- /
- 2013
ITU-T VCEG 과 ISO/IEC MPEG 이 공동으로 구성한 JCT-VC (Joint Collaborative Team on Video Coding)이 표준화를 진행 중인 HEVC (High Efficiency Video Coding)은 H.264/AVC 대비 약 2 배의 압축효율을 갖는다. 하지만, 계층적 구조를 갖는 가변크기 블록의 사용과 재귀적 부호화 구조에 따른 인코더의 복잡도 증가는 개선해야 할 문제점으로 지적되고 있다. 본 논문에서는 현재 표준화가 진행 중인 HEVC 인코더의 실시간 구현을 위한 SIMD 명령어를 이용한 data-level 병렬화 기법, CPU 및 GPU 를 이용한 multi-threading 기법과 같은 다양한 병렬화 기법을 소개한다. 또한, 이러한 병렬화 기법들을 HEVC 인코더에 적용하기 위해 적합한 연산 및 기능 모듈에 대하여 소개한다. 본 연구를 통하여 HM (HEVC reference model)에 적용한 결과 $832{\times}480$ 영상의 경우 20-30fps 의 부호화 속도를 나타냈으며, $1920{\times}1080$ 영상의 경우 5-10fps 의 부호화 속도를 나타내었다.
PDF

Parallel 2D-DWT Hardware Architecture for Image Compression Using the Lifting Scheme (이미지 압축을 위한 Lifting Scheme을 이용한 병렬 2D-DWT 하드웨어 구조)

Kim, Jong-Woog;Chong, Jong-Wha
- Journal of IKEEE
- /
- v.6 no.1 s.10
- /
- pp.80-86
- /
- 2002
This paper presents a fast hardware architecture to implement a 2-D DWT(Discrete Wavelet Transform) computed by lifting scheme framework. The conventional 2-D DWT hardware architecture has problem in internal memory, hardware resource, and latency. The proposed architecture was based on the 4-way partitioned data set. This architecture is configured with a pipelining parallel architecture for 4-way partitioning method. Due to the use of this architecture, total latency was improved by 50%, and memory size was reduced by using lifting scheme.
PDF

An Efficient Parallelized Algorithm of SEED Block Cipher on Cell BE (CELL 프로세서를 이용한 SEED 블록 암호화 알고리즘의 효율적인 병렬화 기법)

Kim, Deok-Ho;Yi, Jae-Young;Ro, Won-Woo
- The KIPS Transactions:PartA
- /
- v.17A no.6
- /
- pp.275-280
- /
- 2010
In this paper, we discuss and propose an efficiently parallelized block cipher algorithm on the CELL BE processor. With considering the heterogeneous feature of the CELL BE architecture, we apply different encoding/decoding methods to PPE and SPE and improve the throughput. Our implementation was fully tested, with execution results showing achievement of high throughput, capable of supporting as high network speed as 2.59 Gbps. Compared to various parallel implementations on multi-core systems, our approach provides speedup of 1.34 in terms of encoding/decoding speed.
https://doi.org/10.3745/KIPSTA.2010.17A.6.275 인용 PDF KSCI

Efficient Parallel CUDA Random Number Generator on NVIDIA GPUs (NVIDIA GPU 상에서의 난수 생성을 위한 CUDA 병렬프로그램)

Kim, Youngtae;Hwang, Gyuhyeon
- Journal of KIISE
- /
- v.42 no.12
- /
- pp.1467-1473
- /
- 2015
In this paper, we implemented a parallel random number generation program on GPU's, which are known for high performance computing, using LCG (Linear Congruential Generator). Random numbers are important in all fields requiring the use of randomness, and LCG is one of the most widely used methods for the generation of pseudo-random numbers. We explained the parallel program using the NVIDIA CUDA model and MPI(Message Passing Interface) and showed uniform distribution and performance results. We also used a Monte Carlo algorithm to calculate pi(${\pi}$) comparing the parallel random number generator with cuRAND, which is a CUDA library function, and showed that our program is much more efficient. Finally we compared performance results using multi-GPU's with those of ideal speedups.
https://doi.org/10.5626/JOK.2015.42.12.1467 인용 KSCI

All-optical serial-to-parallel and parallel-to-serial data format converters using SLALOM (SLALOM을 이용한 전광 직렬-병렬 데이터 형식 변환기)

Lee, Sung-Chul;Lee, Ki-Chul;Lee, Seok;Park, Jin-Woo
- Korean Journal of Optics and Photonics
- /
- v.13 no.5
- /
- pp.425-429
- /
- 2002
In this paper, we propose new simple schemes for all-optical serial-to-parallel and parallel-to-serial data format converters based on a semiconductor laser amplifier in a loop mirror (SLALOM) for all-optical data processing. They have the advantages of simple and easily expandable structure, efficient operation and easy implementation. We implement the proposed all-optical data converters. and experimentally demonstrate their operation.
https://doi.org/10.3807/KJOP.2002.13.5.425 인용 PDF KSCI

Tile-based Parallelizing for a Fast HEVC Encoder (HEVC 부호화기 고속화를 위한 타일 기반 병렬화)

Kim, Younhee;Jun, DongSan;Jung, Soon-Heung;Seok, Jinwuk;Choi, Jin Soo
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2012.07a
- /
- pp.290-293
- /
- 2012
본 논문에서는 기존 AVC 보다 50% 압축성능 향상을 목표로 표준화가 진행되고 있는 차세대표준인 HEVC 부호화기의 속도를 높이기 위한 방안으로, HEVC 의 기술 중 화면 분할 기술인 타일(Tile)을 기반으로 효율적으로 부호화기를 병렬화하는 구조를 제안한다. 부호화기에서 복잡도가 높은 율왜곡 기반 모드 결정 과정을 멀티코어 병렬프로그래밍으로 구현하고, 병렬처리에 의한 속도 개선 결과를 제시한다. 타일은 병렬처리를 지원하기 위해 HEVC 가 채택한 구조로, 화면을 여러 개로 분할하여 부/복호화 할 수 있어 병렬처리 단위로 적합하며, 표준화의 기고서를 통해 화면분할로 인한 압축성능 변화량은 여러 차례 보고되고 있다. 본 논문의 결과에 의하면 타일의 수만큼 쓰레드를 생성하여 각 타일 단위로 율왜곡 기반 부호화 모드 결정을 하도록 병렬화 하였을 때 기존 참조 소프트웨어 대비 12 개의 쓰레드 생성 시 6 배의 속도 개선을 보인다. 향후 병렬로 처리할 수 있는 모듈을 확장하면 쓰레드 수 증가에 따른 속도개선 효과가 증대되어 부호화기 실용화를 위한 실시간 부호화기 개발에 한 걸음 다가갈 수 있을 것이라 기대한다.
PDF

H.264/AVC Fast Intra Mode Decision using GPGPU Parallel Programming (GPGPU 병렬 프로그래밍을 이용한 H.264/AVC 고속 화면내 예측 모드 결정)

Choi, Sung-Jun;Han, Ki-Hun;Yoo, Yeong-Soo
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2011.11a
- /
- pp.110-112
- /
- 2011
GPU의 병렬성과 연산능력을 일반적인 공학적 문제 해결에 적용하는 GPGPU 컴퓨팅에 대한 연구가 최근 활발히 진행되고 있다. 비디오 압축과정에는 많은 양의 화소 데이터에 동일하게 반복되는 연산을 수행하는 알고리즘이 많이 적용되므로 GPGPU를 통한 고속 병렬 계산의 응용 분야로 매우 적합하다. H.264/AVC는 비디오를 압축하는 가장 최신의 국제표준으로 여러 제품군과 서비스에 대한 적용되어 시장에서 널리 사용되고 있다. 본 논문에서는 GPGPU의 응용 분야로 주목 받고 있는 비디오 압축 분야에 대한 적용으로 H.264/AVC의 화면내 예측 모드 결정과정에 GPGPU 병렬 프로그래밍을 적용하여 예측 모드 결정 속도를 향상하는 방법을 제안한다. GPU상에서의 데이터 병렬처리를 위해 CUDA C언어를 사용하였으며, CPU상에서의 연산은 C언어를 사용하여 구현되었다. GPU상에서 프레임 전체에 대한 화면내 예측 모드를 병렬적으로 결정함으로써 이에 소요되는 시간을 줄여 줄 수 있었다. 실험결과 GPU상에서 병렬적으로 예측 모드를 결정할 때 Full-HD급 영상에서 약 2.8배 정도의 속도 향상을 확인할 수 있었다. 향후 GPGPU 병렬 프로그래밍을 화면 내 예측뿐만 아니라 반복되는 연산을 수행하는 다른 알고리즘에도 적용하여 부호화기의 계산 부담을 덜어준다면 고속 실시간 비디오 압축 부호기 개발이 더욱 용이해 질것으로 기대된다.
PDF

Development of Parallel DSP System Using TMS320C6701 (TMS320C6701 을 이용한 병렬 DSP 시스템 개발)

이태호;정수운;이동호
- Proceedings of the IEEK Conference
- /
- 2001.09a
- /
- pp.821-824
- /
- 2001
본 논문에서는 TMS320C6701 을 이용하여 방대한 양의 데이터를 실시간으로 처리할 수 있는 병렬 DSP 시스템을 설계 및 구현한 것에 대하여 나타내었다. 이 병렬 DSP 시스템은 DSP 칩간의 통신과 보드간의 통신이 가능하며, DSP칩이 마스터가 되어 EMIF(External Memory Interface)포트를 통해 다른 DSP 칩의 지역메모리를 엑세스 할 수 있으며, 또한 외부의 호스트 프로세서가 보드 내의 DSP 칩에 프로그램을 다운로딩 할 수 있도록 설계하였다. DSP 칩에 의해 처리된 신호는 PCI 버스를 통하여 호스트로 전송되며, DSP 칩에서 DSP 칩 또는 지역메모리와의 통신은 지역버스를 통해 직접적으로 이루어진다. 병렬 DSP 시스템을 통하여 고속의 병렬신호처리를 수행 할 수 있다.
PDF

A Parallel Processing of Finding Neighbor Agents in Flocking Behaviors Using GPU (GPU를 이용한 무리 짓기에서 이웃 에이전트 찾기의 병렬 처리)

Lee, Jae-Moon
- Journal of Korea Game Society
- /
- v.10 no.5
- /
- pp.95-102
- /
- 2010
This paper proposes a parallel algorithm of the flocking behaviors using GPU. To do this, we used CUDA as the parallel processing architecture of GPU and then analyzed its characteristics and constraints. Based on them, the paper improved the performance by parallelizing to find the neighbors for an agent which requires the largest cost in the flocking behaviors. We implemented the proposed algorithm on GTX 285 GPU and compared experimentally its performance with the original spatial partitioning method. The results of the comparison showed that the proposed algorithm outperformed the original method up to 9 times with respect to the execution time.
PDF KSCI

Compression-Based Ray-Casting of Huge Volume Data on Distributed Memory Environments (분산 메모리 환경에서의 방대한 볼륨데이터의 압축기반 광선추적법)

송동섭;박상훈;임인성
- Proceedings of the Korean Information Science Society Conference
- /
- 2000.04b
- /
- pp.634-636
- /
- 2000
기존의 병렬 볼륨 렌더링 방법들은 프로세서간의 발생하는 많은 통신량 때문에 통신 속도가 매우 빠른 병렬컴퓨터를 이용하였고 통신속도가 느린 분산 환경에서는 구현이 불가능해 보였다. 또한 가시화하려는 볼륨 데이터도 점점 방대해지고 있는 실정이다. 이에 본 논문에서는 통신 속도에 구애받지 앉을뿐더러 매우 큰 볼륨데이터를 다루는 병렬/분산 볼륨 렌더링을 제안한다. 본 방법은 고비용을 필요로 하는 원격 메모리 접근 대신에 압축을 기반으로 하여 필요한 데이터를 지역 메모리에서 빠르게 복원함으로써 좋은 성능향상(speedup)을 나타낸다. 이것은 각 프로세서가 전체 볼륨 데이터를 모두 적재하고 있다는 것을 의미한다. 다라서 렌더링 과정중에 발생하는 프로세서간의 통신을 최소화할 수 있었고, 이런 방식은 높은 통신 비용으로 효율적 병렬/분산 처리가 힘든 분산 메모리 병렬 컴퓨터나 PC/워크스테이션 클러스터상에서 매우 적합하다.
PDF

Search Result 1,474, Processing Time 0.026 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)