• Title/Summary/Keyword: GPU implementation

Search Result 147, Processing Time 0.027 seconds

An Effective Parallel Implementation of Sound Synthesis of Guitar using GPU (GPU를 이용한 기타의 음 합성을 위한 효과적인 병렬 구현)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.8
    • /
    • pp.1-8
    • /
    • 2013
  • This paper proposes an effective parallel implementation of a physical modeling synthesis of guitar on the GPU environment. We used appropriate filter coefficients and adjusted the length of delay line for each open string to generate 44,100 six-polyphonic guitar sounds (E2, A2, D3, G4, B3, E4) by using physical modeling synthesis. In addition, we analyzed the physical modeling synthesis algorithm and observed that we can exploit parallelism inherent in the length of delay line. Thus, we assigned CUDA cores as many as the length of delay line and effectively implemented the physical modeling synthesis using GPU to achieve the highest performance. Experimental results indicated that synthetic guitar sounds using GPU were very similar to the original sounds when we compared their spectra. In addition, GPU achieved 68x and 3x better performance than high-performance TI DSP and CPU, respectively. Furthermore, this paper implemented and evaluated the performance of multi-GPU systems for the physical modeling algorithm.

Efficient Parallel Block-layered Nonbinary Quasi-cyclic Low-density Parity-check Decoding on a GPU

  • Thi, Huyen Pham;Lee, Hanho
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.6 no.3
    • /
    • pp.210-219
    • /
    • 2017
  • This paper proposes a modified min-max algorithm (MMMA) for nonbinary quasi-cyclic low-density parity-check (NB-QC-LDPC) codes and an efficient parallel block-layered decoder architecture corresponding to the algorithm on a graphics processing unit (GPU) platform. The algorithm removes multiplications over the Galois field (GF) in the merger step to reduce decoding latency without any performance loss. The decoding implementation on a GPU for NB-QC-LDPC codes achieves improvements in both flexibility and scalability. To perform the decoding on the GPU, data and memory structures suitable for parallel computing are designed. The implementation results for NB-QC-LDPC codes over GF(32) and GF(64) demonstrate that the parallel block-layered decoding on a GPU accelerates the decoding process to provide a faster decoding runtime, and obtains a higher coding gain under a low $10^{-10}$ bit error rate and low $10^{-7}$ frame error rate, compared to existing methods.

Time Complexity Measurement on CUDA-based GPU Parallel Architecture of Morphology Operation

  • Izmantoko, Yonny S.;Choi, Heung-Kook
    • Journal of Korea Multimedia Society
    • /
    • v.16 no.4
    • /
    • pp.444-452
    • /
    • 2013
  • Operation time of a function or procedure is a thing that always needs to be optimized. Parallelizing the operation is the general method to reduce the operation time of the function. One of the most powerful parallelizing methods is using GPU. In image processing field, one of the most commonly used operations is morphology operation. Three types of morphology operations kernel, na$\ddot{i}$ve, global and shared, are presented in this paper. All kernels are made using CUDA and work parallel on GPU. Four morphology operations (erosion, dilation, opening, and closing) using square structuring element are tested on MRI images with different size to measure the speedup of the GPU implementation over CPU implementation. The results show that the speedup of dilation is similar for all kernels. However, on erosion, opening, and closing, shared kernel works faster than other kernels.

Implementation of Lattice Reduction-aided Detector using GPU on SDR System (SDR 시스템에서 GPU를 사용한 Lattice Reduction-aided 검출기 구현)

  • Kim, Tae Hyun;Leem, Hyun Seok;Choi, Seung Won
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.7 no.3
    • /
    • pp.55-61
    • /
    • 2011
  • This paper presents an implementation of Lattice Reduction (LR)-aided detector for Multiple-Input Multiple-Output (MIMO) system using Graphics Processing Unit (GPU). GPU is a parallel processor which has a number of Arithmetic Logic Units (ALUs), thus, it can minimize the operation time of LR algorithm through the parallelization using multiple threads in the GPU. Through the implemented LR-aided detector, we verify that the LR-aided detector operates a lot faster than Maximum Likelihood (ML) detector. The implemented LR-aided detector has been applied to WiMAX system to show the feasibility of its real-time processing. In addition, we demonstrate that the processing time can be reduced at the cost of 3dB SNR loss by limiting the repeating loop in Lenstra-Lenstra-Lovasz (LLL) algorithm which is frequently used in LR-aided detector.

Implementation of Viterbi Decoder on Massively Parallel GPU for DVB-T Receiver (DVB-T 수신기를 위한 대규모 병렬처리 GPU 기반의 비터비 복호기 구현)

  • Lee, KyuHyung;Lee, Ho-Kyoung;Heo, Seo Weon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.9
    • /
    • pp.3-11
    • /
    • 2013
  • Recently, a plenty of researches have been conducted using the massively parallel processing of GPU for the implementation of communication system. In this paper, we tried to reduce software simulation time applying GPU with sliding block method to Viterbi decoder in DVB-T system which is one of European DTV standards. First of all, we implement DVB-T system by CPU and estimate cost time whereby the system processes one OFDM symbol. Secondly, we implement Viterbi decoder by software using NVIDIA's massive GPU processor. In our work, stream process method is applied to reduce the overhead for data transfer between CPU and GPU, as well as coalescing method to lower the global memory access time. In addition, data structure design method is used to maximize the shared memory usage. Consequently, our proposed method is approximately 11 times faster in 2K mode and 60 times faster in 8K mode for the process in Viterbi decoder.

Parallelization of Feature Detection and Panorama Image Generation using OpenCL and Embedded GPU (OpenCL 및 Embedded GPU를 이용한 영상 특징 추출 및 파노라마 영상 생성의 병렬화)

  • Kang, Seung Heon;Lee, Seung-Jae;Lee, Man Hee;Park, In Kyu
    • Journal of Broadcast Engineering
    • /
    • v.19 no.3
    • /
    • pp.316-328
    • /
    • 2014
  • In this paper, we parallelize the popular feature detection algorithms, i.e. SIFT and SURF, and its application to fast panoramic image generation on the latest embedded GPU. Parallelized algorithms are implemented using recently developed OpenCL as the embedded GPGPU software platform. We compare the implementation efficiency and speed performance of conventional OpenGL Shading Language and OpenCL. Experimental result shows that implementation on OpenCL has comparable performance with GLSL. Compared with the performance on the embedded CPU in the same application processor, the embedded GPU runs 3~4 times faster. As an example of using feature extraction, panorama image synthesis is performed on embedded GPU by applying image matching using detected features.

Implementation of a 3D Graphics Simulator for GP-GPU (GP-GPU 개발을 위한 3차원 그래픽 시뮬레이터 구현)

  • Yeo, Dong-young;Kim, Woo-young;Jung, Hyung-Ki;Lee, Kwang-Yeob
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2009.10a
    • /
    • pp.337-340
    • /
    • 2009
  • Since a hardware accelerator for 3D graphics processing GPU(Graphics Processing Unit)'s performance has been improving constantly. This is the efficient way was introduced for complex graphics application, but it is rarely used to utilize 100% resources on GPU. GP-GPU(general-purpose GPU), including operations on the GPU and supporting common operations can be handled by the processor, is noted by depending on the distribution of resources that can be effectively controlled. In this paper, the simulator was implemented that supports virtual environment of GP-GPU and available for program design and debugging. Through this, the co-design development environment support simultaneous design fast and reliable verification that are available to build the interface of three-dimensional graphics display.

  • PDF

An MPI-CUDA Implementation for Parallel Scalability on Multi-GPU Clusters (멀티-GPU 기반 MPI-CUDA 병렬 성능 확장성)

  • Yi, Hong-Suk;Lee, Seung-Min
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06a
    • /
    • pp.13-15
    • /
    • 2012
  • 매우 빠른 GPU의 성능과 저가의 개발 비용으로, 최신 GPU는 대용량 계산과학 분야에 꼭 필수적인 자원으로 등장하였다. 이 논문에서는 멀티-GPU 클러스터 시스템에서 GPU 컴퓨팅 기술을 적용한 대용량 Monte Carlo 알고리즘을 개발하였다. MPI와 CUDA를 동시에 적용한 결과 8개 GPU까지 병렬 확장성을 얻을 수 있었다. 병렬 성능 확장성 분석 결과, 멀티-GPU 클러스터에서는 GPU 사이의 데이터 통신이 전체 프로그램 성능 향상을 결정하는 매우 중요한 요인임을 보였다.

Efficient Implementation of Cryptography on GPU and GPU Resistance of Cryptography (GPU 상에서의 최적화 암호 구현과 암호의 GPU 내성)

  • Seo, Hwa-Jeong;Kwon, Hyeok-Dong;Kim, Hyun-Jun;Eum, Si-Woo;Sim, Min-Joo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.263-266
    • /
    • 2021
  • GPU 상에서의 효율적인 암호 구현을 위해서는 GPU 내부의 자원인 메모리와 명령어셋을 구현하고자 하는 암호 구조에 맞추어 사용하는 것이 중요하다. 본 논문에서는 Blowfish와 RC4를 최신 GPU 프로세서 상에서 최적 구현해 보고 성능을 저하시키는 요인들과 향상시키는 요인들을 비교 분석한다. 특히 파일암호화와 패스워드 크래킹에서 발생하는 암호화 구현 고려 사항에 대해 확인하며 해당 특징이 GPU 암호 구현 상에서 미치는 영향에 대해 확인해 보도록 한다. 마지막으로 앞에서 구현한 결과물의 성능을 저하시키는 요소에 대한 분석을 기반으로하여 높은 GPU 내성을 가지는 암호 설계를 위해 필요한 구조에 대해 확인해 보도록 한다.

Implementation of GPU based MPEG-2 Decoder (GPU 기반의 MPEG-2 디코더의 구현)

  • Kim, Kyung-Su;Kim, Hong-Sik;Kim, Cheong-Ghil;Park, Woo-Chan
    • Journal of Digital Contents Society
    • /
    • v.9 no.3
    • /
    • pp.371-377
    • /
    • 2008
  • Recently the performance of GPU is increasing much faster compared to GPU and GPU is used for various application programs. In this paper, MPEG-2 Decoder is implemented based on a GPU programming language, CG. The proposed methodology is to perform block rendering with texture data according to video standard with very high parallelism by using the pipeline of GPU which is a stream processing structure. To reduce the data bandwidth between system memory and GPU, local memory is used for graphic card. According to the experiment, the proposed scheme shows performance improvement by more than 2 times compared to CPU based scheme.

  • PDF