• Title/Summary/Keyword: GPU algorithm

Search Result 266, Processing Time 0.03 seconds

IPC-based Dynamic SM management on GPGPU for Executing AES Algorithm

  • Son, Dong Oh;Choi, Hong Jun;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.2
    • /
    • pp.11-19
    • /
    • 2020
  • Modern GPU can execute general purpose computation on the graphic processing unit, and provide high performance by exploiting many core on GPU. To run AES algorithm efficiently, parallel computational resources are required. However, computational resource of CPU architecture are not enough to cryptographic algorithm such as AES whereas GPU architecture has mass parallel computation resources. Therefore, this paper reduce the time to execute AES by employing parallel computational resource on GPGPU. Unfortunately, AES cannot utilize computational resource on GPGPU since it isn't suitable to GPGPU architecture. In this paper, IPC based dynamic SM management technique are proposed to efficiently execute AES on GPGPU. IPC based dynamic SM management can increase and decrease the number of active SMs by using IPC in run-time. According to simulation results, proposed technique improve the performance by increasing resource utilization compared to baseline GPGPU architecture. The results show that AES improve the performance by 41.2% on average.

AN EFFICIENT INCOMPRESSIBLE FREE SURFACE FLOW SIMULATION USING GPU (GPU를 이용한 효율적인 비압축성 자유표면유동 해석)

  • Hong, H.E.;Ahn, H.T.;Myung, H.J.
    • Journal of computational fluids engineering
    • /
    • v.17 no.2
    • /
    • pp.35-41
    • /
    • 2012
  • This paper presents incompressible Navier-Stokes solution algorithm for 2D Free-surface flow problems on the Cartesian mesh, which was implemented to run on Graphics Processing Units(GPU). The INS solver utilizes the variable arrangement on the Cartesian mesh, Finite Volume discretization along Constrained Interpolation Profile-Conservative Semi-Lagrangian(CIP-CSL). Solution procedure of incompressible Navier-Stokes equations for free-surface flow takes considerable amount of computation time and memory space even in modern multi-core computing architecture based on Central Processing Units(CPUs). By the recent development of computer architecture technology, Graphics Processing Unit(GPU)'s scientific computing performance outperforms that of CPU's. This paper focus on the utilization of GPU's high performance computing capability, and presents an efficient solution algorithm for free surface flow simulation. The performance of the GPU implementations with double precision accuracy is compared to that of the CPU code using an representative free-surface flow problem, namely. dam-break problem.

Comparison Speed of Pedestrian Detection with Parallel Processing Graphic Processor and General Purpose Processor (병렬처리 그래픽 프로세서와 범용 프로세서에서의 보행자 검출 처리 속도 비교)

  • Park, Jang-Sik
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.10 no.2
    • /
    • pp.239-246
    • /
    • 2015
  • Video based object detection is basic technology of implementing smart CCTV system. Various features and algorithms are developed to detect object, however computations of them increase with the performance. In this paper, performances of object detection algorithms with GPU and CPU are compared. Adaboost and SVM algorithm which are widely used to detect pedestrian detection are implemented with CPU and GPU, and speeds of detection processing are compared for the same video. As results of frame rate comparison of Adaboost and SVM algorithm, it is shown that the frame rate with GPU is faster than CPU.

Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders (HEVC 부호화기에서 GPU 기반 정수화소 움직임 추정을 고속화하기 위한 적응적인 탐색영역 결정 방법)

  • Kim, Sangmin;Lee, Dongkyu;Sim, Dong-Gyu;Oh, Seoung-Jun
    • Journal of Broadcast Engineering
    • /
    • v.19 no.5
    • /
    • pp.699-712
    • /
    • 2014
  • In this paper, we propose a new Adaptive Search Range (ASR) decision algorithm for accelerating GPU-based Integer-pel Motion Estimation (IME) of High Efficiency Video Coding (HEVC). For deciding the ASR, we classify a frame into two models using Motion Vector Differences (MVDs) then adaptively decide the search ranges of each model. In order to apply the proposed algorithm to the GPU-based ME process, starting points of the ME are decided using only temporal Motion Vectors (MVs). The CPU decides the ASR as well as the starting points and transfers them to the GPU. Then, the GPU performs the integer-pel ME. The proposed algorithm reduces the total encoding time by 37.9% with BD-rate increase of 1.1% and yields 951.2 times faster ME against the CPU-based anchor. In addition, the proposed algorithm achieves the time reduction of 57.5% in the ME running time with the negligible coding loss of 0.6%, compared with the simple GPU-based ME without ASR decision.

Adaptive Processing Algorithm Allocation on OpenCL-based FPGA-GPU Hybrid Layer for Energy-Efficient Reconfigurable Acceleration of Abnormal ECG Diagnosis (비정상 ECG 진단의 에너지 효율적인 재구성 가능한 가속을 위한 OpenCL 기반 FPGA-GPU 혼합 계층 적응 처리 알고리즘 할당)

  • Lee, Dongkyu;Lee, Seungmin;Park, Daejin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.10
    • /
    • pp.1279-1286
    • /
    • 2021
  • The electrocardiogram (ECG) signal is a good indicator for early diagnosis of heart abnormalities. The ECG signal has a different reference normal signal for each person. And it requires lots of data to diagnosis. In this paper, we propose an adaptive OpenCL-based FPGA-GPU hybrid-layer platform to efficiently accelerate ECG signal diagnosis. As a result of diagnosing 19870 number of ECG signals of MIT-BIH arrhythmia database on the platform, the FPGA accelerator takes 1.15s, that the execution time was reduced by 89.94% and the power consumption was reduced by 84.0% compared to the software execution. The GPU accelerator takes 1.87s, that the execution time was reduced by 83.56% and the power consumption was reduced by 62.3% compared to the software execution. Although the proposed FPGA-GPU hybrid platform has a slower diagnostic speed than the FPGA accelerator, it can operate a flexible algorithm according to the situation by using the GPU.

An Efficient Graph Algorithm Processing Scheme using GPUs with Limited Memory (제한된 메모리를 가진 GPU를 이용한 효율적인 그래프 알고리즘 처리 기법)

  • Song, Sang-ho;Lee, Hyeon-byeong;Choi, Do-jin;Lim, Jong-tae;Bok, Kyoung-soo;Yoo, Jae-soo
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.8
    • /
    • pp.81-93
    • /
    • 2022
  • Recently, research on processing a large-capacity graph using GPUs has been conducting. In order to process a large-capacity graph in a GPU with limited memory, the graph must be divided into subgraphs and then processed by scheduling subgraphs. In this paper, we propose an efficient graph algorithm processing scheme in GPU environments with limited memory and performance evaluation. The proposed scheme consists of a graph differential subgraph scheduling method and a graph segmentation method. The bulk graph segmentation method determines how a large-capacity graph can be segmented into subgraphs so that it can be processed efficiently by the GPU. The differential subgraph scheduling method schedule subgraphs processed by GPUs to reduce redundant transmission of the repeatedly used data between HOST-GPUs. It shows the superiority of the proposed scheme by performing various performance evaluations.

CUDA based parallel design of a shot change detection algorithm using frame segmentation and object movement

  • Kim, Seung-Hyun;Lee, Joon-Goo;Hwang, Doo-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.9-16
    • /
    • 2015
  • This paper proposes the parallel design of a shot change detection algorithm using frame segmentation and moving blocks. In the proposed approach, the high parallel processing components, such as frame histogram calculation, block histogram calculation, Otsu threshold setting function, frame moving operation, and block histogram comparison, are designed in parallel for NVIDIA GPU. In order to minimize memory access delay time and guarantee fast computation, the output of a GPU kernel becomes the input data of another kernel in a pipeline way using the shared memory of GPU. In addition, the optimal sizes of CUDA processing blocks and threads are estimated through the prior experiments. In the experimental test of the proposed shot change detection algorithm, the detection rate of the GPU based parallel algorithm is the same as that of the CPU based algorithm, but the average of processing time speeds up about 6~8 times.

Efficient Collaboration Method Between CPU and GPU for Generating All Possible Cases in Combination (조합에서 모든 경우의 수를 만들기 위한 CPU와 GPU의 효율적 협업 방법)

  • Son, Ki-Bong;Son, Min-Young;Kim, Young-Hak
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.7 no.9
    • /
    • pp.219-226
    • /
    • 2018
  • One of the systematic ways to generate the number of all cases is a combination to construct a combination tree, and its time complexity is O($2^n$). A combination tree is used for various purposes such as the graph homogeneity problem, the initial model for calculating frequent item sets, and so on. However, algorithms that must search the number of all cases of a combination are difficult to use realistically due to high time complexity. Nevertheless, as the amount of data becomes large and various studies are being carried out to utilize the data, the number of cases of searching all cases is increasing. Recently, as the GPU environment becomes popular and can be easily accessed, various attempts have been made to reduce time by parallelizing algorithms having high time complexity in a serial environment. Because the method of generating the number of all cases in combination is sequential and the size of sub-task is biased, it is not suitable for parallel implementation. The efficiency of parallel algorithms can be maximized when all threads have tasks with similar size. In this paper, we propose a method to efficiently collaborate between CPU and GPU to parallelize the problem of finding the number of all cases. In order to evaluate the performance of the proposed algorithm, we analyze the time complexity in the theoretical aspect, and compare the experimental time of the proposed algorithm with other algorithms in CPU and GPU environment. Experimental results show that the proposed CPU and GPU collaboration algorithm maintains a balance between the execution time of the CPU and GPU compared to the previous algorithms, and the execution time is improved remarkable as the number of elements increases.

An Effective Parallel Implementation of Sound Synthesis of Guitar using GPU (GPU를 이용한 기타의 음 합성을 위한 효과적인 병렬 구현)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.8
    • /
    • pp.1-8
    • /
    • 2013
  • This paper proposes an effective parallel implementation of a physical modeling synthesis of guitar on the GPU environment. We used appropriate filter coefficients and adjusted the length of delay line for each open string to generate 44,100 six-polyphonic guitar sounds (E2, A2, D3, G4, B3, E4) by using physical modeling synthesis. In addition, we analyzed the physical modeling synthesis algorithm and observed that we can exploit parallelism inherent in the length of delay line. Thus, we assigned CUDA cores as many as the length of delay line and effectively implemented the physical modeling synthesis using GPU to achieve the highest performance. Experimental results indicated that synthetic guitar sounds using GPU were very similar to the original sounds when we compared their spectra. In addition, GPU achieved 68x and 3x better performance than high-performance TI DSP and CPU, respectively. Furthermore, this paper implemented and evaluated the performance of multi-GPU systems for the physical modeling algorithm.

Parallelization and Performance Optimization of the Boyer-Moore Algorithm on GPU (Boyer-Moore 알고리즘을 위한 GPU상에서의 병렬 최적화)

  • Jeong, Yosang;Tran, Nhat-Phuong;Lee, Myungho;Nam, Dukyun;Kim, Jik-Soo;Hwang, Soonwook
    • KIISE Transactions on Computing Practices
    • /
    • v.21 no.2
    • /
    • pp.138-143
    • /
    • 2015
  • The Boyer-Moore algorithm is a single pattern string matching algorithm that is widely used in various applications such as computer and internet security, and bioinformatics. This algorithm is computationally demanding and requires high-performance parallel processing. In this paper, we propose a parallelization and performance optimization methodology for the BM algorithm on a GPU. Our methodology adopts an algorithmic cascading technique. This results in significant reductions in the mapping overheads for the threads participating in the parallel string matching. It also results in the efficient utilization of the multithreading capability of the GPU which improves the load balancing among threads. Our experimental results show that this approach achieves a 45-times speedup at maximum, in comparison with a serial execution.