• Title/Summary/Keyword: GPU algorithm

Search Result 266, Processing Time 0.034 seconds

GPU Accelerating Methods for Pease FFT Processing (Pease FFT 처리를 위한 GPU 가속 기법)

  • Oh, Se-Chang;Joo, Young-Bok;Kwon, Oh-Young;Huh, Kyung-Moo
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.20 no.1
    • /
    • pp.37-41
    • /
    • 2014
  • FFT (Fast Fourier Transform) has been widely used in various fields such as image processing, voice processing, physics, astronomy, applied mathematics and so forth. Much research has been conducted due to the importance of the FFT and recently new FFT algorithms using a GPU (Graphics Processing Unit) have been developed for the purpose of much faster processing. In this paper, the new optimal FFT algorithm using the Pease FFT algorithm has been proposed reflecting the hardware configuration of a GPGPU (General Purpose computing of GPU). According to the experiments, the proposed algorithm outperformed by between 3% to 43% compared to the CUFFT algorithm.

Analysis of GPU-based Parallel Shifted Sort Algorithm by comparing with General GPU-based Tree Traversal (일반적인 GPU 트리 탐색과의 비교실험을 통한 GPU 기반 병렬 Shifted Sort 알고리즘 분석)

  • Kim, Heesu;Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.6
    • /
    • pp.1151-1156
    • /
    • 2017
  • It is common to achieve lower performance in traversing tree data structures in GPU than one expects. In this paper, we analyze the reason of lower-than-expected performance in GPU tree traversal and present that the warp divergences is caused by the branch instructions ("if${\ldots}$ else") which appear commonly in tree traversal CUDA codes. Also, we compare the parallel shifted sort algorithm which can reduce the number of warp divergences with a kd-tree CUDA implementation to show that the shifted sort algorithm can work faster than the kd-tree CUDA implementation thanks to less warp divergences. As the analysis result, the shifted sort algorithm worked about 16-fold faster than the kd-tree CUDA implementation for $2^{23}$ query points and $2^{23}$ data points in $R^3$ space. The performance gaps tend to increase in proportion to the number of query points and data points.

Parallel Algorithm for Matrix-Matrix Multiplication on the GPU (GPU 기반 행렬 곱셈 병렬처리 알고리즘)

  • Park, Sangkun
    • Journal of Institute of Convergence Technology
    • /
    • v.9 no.1
    • /
    • pp.1-6
    • /
    • 2019
  • Matrix multiplication is a fundamental mathematical operation that has numerous applications across most scientific fields. In this paper, we presents a parallel GPU computation algorithm for dense matrix-matrix multiplication using OpenGL compute shader, which can play a very important role as a fundamental building block for many high-performance computing applications. Experimental results on NVIDIA Quad 4000 show that the proposed algorithm runs about 208 times faster than previous CPU algorithm and achieves performance of 75 GFLOPS in single precision for dense matrices with matrix size 4,096. Such performance proves that our algorithm is practical for real applications.

Accelerating Gaussian Hole-Filling Algorithm using GPU (GPU를 이용한 Gaussian Hole-Filling Algorithm 가속)

  • Park, Jun-Ho;Han, Tack-Don
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2012.07a
    • /
    • pp.79-82
    • /
    • 2012
  • 3차원 멀티미디어 서비스에 대한 관심이 높아짐에 따라 관련 연구들이 현재 다양하게 논의되고 있다. Stereoscopy영상을 생성하기 위한 기존의 방법으로는 두 대의 촬영용 카메라를 일정한 간격으로 띄워놓고 피사체를 촬영한 후 해당 좌시점과 우시점을 생성하는 방법을 이용하였다. 하지만 이는 영상 대역폭의 부담을 가져오게 된다. 이를 해결하기 위하여 Depth정보와 한 장의 영상을 이용한 DIBR(Depth Image Based Rendering) Algorithm에 대한 연구가 많이 이루어지고 있다. 그중 Gaussian Depth Map을 이용한 Hole-Filling 방법은 DIBR에서 가장 자연스러운 결과를 보여주지만 다른 DIBR Algorithm들에 비해 속도가 현저히 느리다는 단점이 있다. 본 논문에서는 영상 생성의 고속화를 위해 GPU를 이용한 Gaussian Hole-Filling Algorithm의 병렬처리 구조를 제안하고 이를 이용한 DIBR Algorithm 생성과정을 제시한다.

  • PDF

Intelligent Face Recognition and Tracking System to Distribute GPU Resources using CUDA (쿠다를 사용하여 GPU 리소스를 분배하는 지능형 얼굴 인식 및 트래킹 시스템)

  • Kim, Jae-Heong;Lee, Seung-Ho
    • Journal of IKEEE
    • /
    • v.22 no.2
    • /
    • pp.281-288
    • /
    • 2018
  • In this paper, we propose an intelligent face recognition and tracking system that distributes GPU resources using CUDA. The proposed system consists of five steps such as GPU allocation algorithm that distributes GPU resources in optimal state, face area detection and face recognition using deep learning, real time face tracking, and PTZ camera control. The GPU allocation algorithm that distributes multi-GPU resources optimally distributes the GPU resources flexibly according to the activation level of the GPU, unlike the method of allocating the GPU to the thread fixedly. Thus, there is a feature that enables stable and efficient use of multiple GPUs. In order to evaluate the performance of the proposed system, we compared the proposed system with the non - distributed system. As a result, the system which did not allocate the resource showed unstable operation, but the proposed system showed stable resource utilization because it was operated stably. Thus, the utility of the proposed system has been demonstrated.

Design of Line Scratch Detection and Restoration Algorithm using GPU (GPU를 이용한 선형 스크래치 탐지와 복원 알고리즘의 설계)

  • Lee, Joon-Goo;Shim, She-Yong;You, Byoung-Moon;Hwang, Doo-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.4
    • /
    • pp.9-16
    • /
    • 2014
  • This paper proposes a linear scratch detection and restoration algorithm using pixel data comparison in a single frame or consecutive frames. There exists a high parallelism in that a scratch detection and restoration algorithm needs a large amount of comparison operations. The proposed scratch detection and restoration algorithm is designed with a GPU for fast computation. We test the proposed algorithm in sequential and parallel processing with the set of digital videos in National Archive of Korea. In the experiments, the scratch detection rate of consecutive frames is as fast as about 20% for that of a single frame. The detection and restoration rates of a GPU-based algorithm are similar to those of a CPU-based algorithm, but the parallel implementation speeds up to about 50 times.

GPU Based Incremental Connected Component Processing in Dynamic Graphs (동적 그래프에서 GPU 기반의 점진적 연결 요소 처리)

  • Kim, Nam-Young;Choi, Do-Jin;Bok, Kyoung-Soo;Yoo, Jae-Soo
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.6
    • /
    • pp.56-68
    • /
    • 2022
  • Recently, as the demand for real-time processing increases, studies on a dynamic graph that changes over time has been actively done. There is a connected components processing algorithm as one of the algorithms for analyzing dynamic graphs. GPUs are suitable for large-scale graph calculations due to their high memory bandwidth and computational performance. However, when computing the connected components of a dynamic graph using the GPU, frequent data exchange occurs between the CPU and the GPU during real graph processing due to the limited memory of the GPU. The proposed scheme utilizes the Weighted-Quick-Union algorithm to process large-scale graphs on the GPU. It supports fast connected components computation by applying the size to the connected component label. It computes the connected component by determining the parts to be recalculated and minimizing the data to be transmitted to the GPU. In addition, we propose a processing structure in which the GPU and the CPU execute asynchronously to reduce the data transfer time between GPU and CPU. We show the excellence of the proposed scheme through performance evaluation using real dataset.

A STUDY OF THE APPLICATION OF DELAUNAY GRID GENERATION ON GPU USING CUDA LIBRARY (GPU Library CUDA를 이용한 효율적인 Delaunay 격자 생성에 관한 연구)

  • Song, J.H.;Kang, S.H.;Kim, G.M.;Kim, B.S.
    • 한국전산유체공학회:학술대회논문집
    • /
    • 2011.05a
    • /
    • pp.194-198
    • /
    • 2011
  • In this study, an efficient algorithm for Delaunay triangulation of a number of points which can be used on a GPU-based parallel computation is studied The developed algorithm is programmed using CUDA library. and the program takes full advantage of parallel computation which are concurrently performed on each of the threads on GPU. The results of partitioned triangulation collected from the GPU computation requires proper stitching between neighboring partitions and calculation of connectivities among triangular cells on CPU In this study, the effect of number of threads on the efficiency and total duration for Delaunay grid generation is studied. And it is also shown that GPU computing using CUDA for Delaunay grid generation is feasible and it saves total time required for the triangulation of the large number points compared to the sequential CPU-based triangulation programs.

  • PDF

Visibility based N-Body GPU Collision Detection (가시화 기반 N-body GPU 충돌 체크 방법)

  • Sung, Mankyu
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.400-403
    • /
    • 2022
  • This paper propose a GPU-based N-body collision detection algorithm using LBVH (Linear Bounding Volume Hierarchy) technique. This algorithm introduces a new modified Morton code scheme where the codes use an information about how much each body takes a space in the screen space. This scheme improves the GPU sorting performance of the N-Body because it culls out invisible objects in natural manner. Through the experiments, we verifies that the proposed algorithms can have at least 15% performance improvement over the existing methods

  • PDF

GPU-Based Acceleration of Quantum-Inspired Evolutionary Algorithm (GPU를 이용한 Quantum-Inspired Evolutionary Algorithm 가속)

  • Ryoo, Ji-Hyun;Park, Han-Min;Choi, Ki-Young
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.49 no.8
    • /
    • pp.1-9
    • /
    • 2012
  • Quantum-Inspired Evolutionary Algorithm(QEA) contains sufficient data-level parallelism to be naturally accelerated on GPUs. For an efficient reduction of execution time, however, careful task-mapping should be done to properly reflect the characteristics of CPU and GPU. Furthermore, when deciding which part of the application should run on GPU, we need to consider the data transfer between CPU and GPU memory spaces as well as the data-level parallelism. In addition, the usage of zero-copy host memory, proper choice of the execution configuration, and thread organization considering memory coalescing is important to further reduce the execution time. With all these techniques, we could run QEA 3.69 times faster on average in comparison with the multi-threading CPU for the case of 0-1 knapsack problem with 30,000 items.