• Title/Summary/Keyword: CUDA Implementation

Search Result 67, Processing Time 0.021 seconds

Accelerating the Sweep3D for a Graphic Processor Unit

  • Gong, Chunye;Liu, Jie;Chen, Haitao;Xie, Jing;Gong, Zhenghu
    • Journal of Information Processing Systems
    • /
    • v.7 no.1
    • /
    • pp.63-74
    • /
    • 2011
  • As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer a great faculty in solving many high-performance computing applications. Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application. The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU. In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the finegrained parallel architecture of the GPU. Our results show that the overall performance of Sweep3D on the CPU-GPU hybrid platform can be improved up to 4.38 times as compared to the CPU-based implementation.

PIPO block cipher optimal implementation technology trend (PIPO 경량 블록암호 최적 구현 기술 동향)

  • Min-Woo Lee;Dong-Hyun Kim;Se-Young Yoon;Hwa-Jeong Seo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.107-109
    • /
    • 2023
  • 본 논문은 PIPO 알고리즘의 최적 구현 기술들에 대한 연구 동향을 살핀다. PIPO는 선형, 차분 공격에 안전한 S-box를 사용하는 SPN 구조의 경량 블록 암호 알고리즘이다. 블록 크기는 64비트이고 비밀키 크기에 따라 PIPO-128과 PIPO-256으로 나뉜다. PIPO 알고리즘의 S-Layer, R-Layer, Addroundkey의 3가지 내부 동작과정과 각 라운드에서 사용되는 연산들에 대한 자세한 설명이 제공된다. 본 논문에서는 RISC-V 및 ARM 프로세서, CUDA GPGPU에서 PIPO 알고리즘을 최적화 구현하는 방법을 다룬다. 해당 연구들에선 최적 구현 기술을 적용하여 PIPO 암호를 적용하는 IoT 장치에서도 안전하고 빠른 암,복호화를 수행할 수 있음을 보였고, 기존 연구와의 비교를 통해 성능 향상이 이루어짐을 확인할 수 있다.

An Effective Parallel Implementation of Sound Synthesis of Guitar using GPU (GPU를 이용한 기타의 음 합성을 위한 효과적인 병렬 구현)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.8
    • /
    • pp.1-8
    • /
    • 2013
  • This paper proposes an effective parallel implementation of a physical modeling synthesis of guitar on the GPU environment. We used appropriate filter coefficients and adjusted the length of delay line for each open string to generate 44,100 six-polyphonic guitar sounds (E2, A2, D3, G4, B3, E4) by using physical modeling synthesis. In addition, we analyzed the physical modeling synthesis algorithm and observed that we can exploit parallelism inherent in the length of delay line. Thus, we assigned CUDA cores as many as the length of delay line and effectively implemented the physical modeling synthesis using GPU to achieve the highest performance. Experimental results indicated that synthetic guitar sounds using GPU were very similar to the original sounds when we compared their spectra. In addition, GPU achieved 68x and 3x better performance than high-performance TI DSP and CPU, respectively. Furthermore, this paper implemented and evaluated the performance of multi-GPU systems for the physical modeling algorithm.

Efficient Implementation of Candidate Region Extractor for Pedestrian Detection System with Stereo Camera based on GP-GPU (스테레오 영상 보행자 인식 시스템의 후보 영역 검출을 위한 GP-GPU 기반의 효율적 구현)

  • Jeong, Geun-Yong;Jeong, Jun-Hee;Lee, Hee-Chul;Jeon, Gwang-Gil;Cho, Joong-Hwee
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.8 no.2
    • /
    • pp.121-128
    • /
    • 2013
  • There have been various research efforts for pedestrian recognition in embedded imaging systems. However, many suffer from their heavy computational complexities. SVM classification method has been widely used for pedestrian recognition. The reduction of candidate region is crucial for low-complexity scheme. In this paper, We propose a real time HOG based pedestrian detection system on GPU which images are captured by a pair of cameras. To speed up humans on road detection, the proposed method reduces a number of detection windows with disparity-search and near-search algorithm and uses the GPU and the NVIDIA CUDA framework. This method can be achieved speedups of 20% or more compared to the recent GPU implementations. The effectiveness of our algorithm is demonstrated in terms of the processing time and the detection performance.

Implementation of Parallel Computer Generated Hologram Using Multi-GPGPU (다중 GPGPU를 이용한 컴퓨터 생성 홀로그램의 병렬화 구현)

  • Seo, Young-Ho;Lee, Yoon-Hyuk;Kim, Dong-Wook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.18 no.5
    • /
    • pp.1177-1186
    • /
    • 2014
  • Computer-generated hologram (CGH) is to mathematically model optical phenomenon with digital computer. Because it requires huge amount of computational power, a fast and high performance technique is needed. In this paper, we proposed two parallelizations for CGH calculation. The first is to parallelize CGH algorithm in a GPU (general processing unit) and the second is to parallelize multiple GPUs. The proposed algorithm was implemented in GTX780 Ti GPU. It calculates a $1,024{\times}1,024$ hologram with 10K object points for about 24ms.

Evaluation of GPU Computing Capacity for All-in-view GNSS SDR Implementation

  • Yun Sub, Choi;Hung Seok, Seo;Young Baek, Kim
    • Journal of Positioning, Navigation, and Timing
    • /
    • v.12 no.1
    • /
    • pp.75-81
    • /
    • 2023
  • In this study, we design an optimized Graphics Processing Unit (GPU)-based GNSS signal processing technique with the goal of designing and implementing a GNSS Software Defined Receiver (SDR) that can operate in real time all-in-view mode under multi-constellation and multi-frequency signal environment. In the proposed structure the correlators of the existing GNSS SDR are processed by the GPU. We designed a memory structure and processing method that can minimize memory access bottlenecks and optimize the GPU memory resource distribution. The designed GNSS SDR can select and operate only the desired GNSS or desired satellite signals by user input. Also, parameters such as the number of quantization bits, sampling rate, and number of signal tracking arms can be selected. The computing capability of the designed GPU-based GNSS SDR was evaluated and it was confirmed that up to 2400 channels can be processed in real time. As a result, the GPU-based GNSS SDR has sufficient performance to operate in real-time all-in-view mode. In future studies, it will be used for more diverse GNSS signal processing and will be applied to multipath effect analysis using more tracking arms.

Implementation of GPU Based Polymorphic Worm Detection Method and Its Performance Analysis on Different GPU Platforms (GPU를 이용한 Polymorphic worm 탐지 기법 구현 및 GPU 플랫폼에 따른 성능비교)

  • Lee, Sunwon;Song, Chihwan;Lee, Injoon;Joh, Taewon;Kang, Jaewoo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2010.11a
    • /
    • pp.1458-1461
    • /
    • 2010
  • 작년 7월 7일에 있었던 DDoS 공격과 같이 악성 코드로 인한 피해의 규모가 해마다 증가하고 있다. 특히 변형 웜(Polymorphic Worm)은 기존의 방법으로 1차 공격에서의 탐지가 어렵기 때문에 그 위험성이 더 크다. 이에 본 연구에서는 바이오 인포매틱스(Bioinformatics) 분야에서 유전자들의 유사성과 특징을 찾기 위한 방법 중 하나인 Local Alignment를 소개하고 이를 변형 웜 탐지에 적용한다. 또한 수행의 병렬화 및 알고리즘 변형을 통하여 기존 알고리즘의 $O(n^4)$수행시간이라는 단점을 극복한다. 병렬화는 NVIDIA사의 GPU를 이용한 CUDA 프로그래밍과 AMD사의 GPU를 사용한 OpenCL 프로그래밍을 통하여 수행되었다. 이로써 각 GPGPU 플랫폼에서의 Local Alignment를 이용한 변형 웜 탐지 알고리즘의 성능을 비교하였다.

Parallel Design and Implementation of Shot Boundary Detection Algorithm (샷 경계 탐지 알고리즘의 병렬 설계와 구현)

  • Lee, Joon-Goo;Kim, SeungHyun;You, Byoung-Moon;Hwang, DooSung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.2
    • /
    • pp.76-84
    • /
    • 2014
  • As the number of high-density videos increase, parallel processing approaches are necessary to process a large-scale of video data. When a processing method of video data requires thousands of simple operations, GPU-based parallel processing is preferred to CPU-based parallel processing by way of reducing the time and space complexities of a given computation problem. This paper studies the parallel design and implementation of a shot-boundary detection algorithm. The proposed shot-boundary detection algorithm uses pixel brightness comparisons and global histogram data among the blocks of frames, and the computation of these data is characterized with the high parallelism for the related operations. In order to maximize these operations in parallel, the computations of the pixel brightness and histogram are designed in parallel and implemented in NVIDIA GPU. The GPU-based shot detection method is tested with 10 videos from the set of videos in National Archive of Korea. In experiments, the detection rate is similar but the computation time is about 10 time faster to that of the CPU-based algorithm.

Implementation of Viterbi Decoder on Massively Parallel GPU for DVB-T Receiver (DVB-T 수신기를 위한 대규모 병렬처리 GPU 기반의 비터비 복호기 구현)

  • Lee, KyuHyung;Lee, Ho-Kyoung;Heo, Seo Weon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.9
    • /
    • pp.3-11
    • /
    • 2013
  • Recently, a plenty of researches have been conducted using the massively parallel processing of GPU for the implementation of communication system. In this paper, we tried to reduce software simulation time applying GPU with sliding block method to Viterbi decoder in DVB-T system which is one of European DTV standards. First of all, we implement DVB-T system by CPU and estimate cost time whereby the system processes one OFDM symbol. Secondly, we implement Viterbi decoder by software using NVIDIA's massive GPU processor. In our work, stream process method is applied to reduce the overhead for data transfer between CPU and GPU, as well as coalescing method to lower the global memory access time. In addition, data structure design method is used to maximize the shared memory usage. Consequently, our proposed method is approximately 11 times faster in 2K mode and 60 times faster in 8K mode for the process in Viterbi decoder.

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units (GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭)

  • Ho, ThienLuan;Kim, HyunJin;Oh, SeungRohk
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.6
    • /
    • pp.955-961
    • /
    • 2017
  • In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.