• Title/Summary/Keyword: GPU algorithm

Search Result 265, Processing Time 0.023 seconds

High-Performance Multi-GPU Rendering Based on Implicit Synchronization (묵시적 동기화 기반의 고성능 다중 GPU 렌더링)

  • Kim, Younguk;Lee, Sungkil
    • Journal of KIISE
    • /
    • v.42 no.11
    • /
    • pp.1332-1338
    • /
    • 2015
  • Recently, growing attention has been paid to multi-GPU rendering to support real-time high-quality rendering at high resolution. In order to attain high performance in real-time multi-GPU rendering, great care needs to be taken to reduce the overhead of data transfer among GPUs and frame composition. This paper presents a novel multi-GPU algorithm that greatly enhances split frame rendering with implicit query-based synchronization. In order to support implicit synchronization in frame composition, we further present a message queue-based scheduling algorithm. We carried out an experiment to evaluate our algorithm, and found that our algorithm improved rendering performance up to 200% more than previously existing algorithms.

GPU-Based Parallel Collision Detection for Deformable Objects (변형 물체를 위한 GPU 기반 병렬 충돌 감지)

  • Sung, Nak-Jun;Kim, Min Sang;Hong, Min;Choi, Yoo-Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.7 no.1
    • /
    • pp.25-32
    • /
    • 2018
  • Due to heavy computational cost, deformable object simulation requires more effective collision detection method than rigid body simulation. However, when the CPU-based collision detection algorithm is purely applied to the GPU environment, the collision detection algorithm and the data structure optimized for the GPU environment are essential because the performance of the GPU can not be used properly. Therefore, we propose a GPU-based parallel collision detection algorithm for mass-spring system which is widely used for deformable object representation in this paper. The proposed method uses a parallel algorithm and data structure to reduce collision detection cost through GPU-based curling algorithm using AABB-Octree structure. In this paper, we prove the effectiveness of the proposed method by comparing the intersection test of all triangle pairs in parallel. The results of experimental tests show that the proposed method improves the performance by about 24% on average. Therefore, it is expected that the proposed method can improve the performance of real-time simulation for deformable objects.

Large-scale 3D fast Fourier transform computation on a GPU

  • Jaehong Lee;Duksu Kim
    • ETRI Journal
    • /
    • v.45 no.6
    • /
    • pp.1035-1045
    • /
    • 2023
  • We propose a novel graphics processing unit (GPU) algorithm that can handle a large-scale 3D fast Fourier transform (i.e., 3D-FFT) problem whose data size is larger than the GPU's memory. A 1D FFT-based 3D-FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data-transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large-scale benchmarks and compare its performance with the state-of-the-art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU-based 3D-FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive-scale problems, whereas our method's performance is stable.

The GPU-based Parallel Processing Algorithm for Fast Inspection of Semiconductor Wafers (반도체 웨이퍼 고속 검사를 위한 GPU 기반 병렬처리 알고리즘)

  • Park, Youngdae;Kim, Joon Seek;Joo, Hyonam
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.19 no.12
    • /
    • pp.1072-1080
    • /
    • 2013
  • In a the present day, many vision inspection techniques are used in productive industrial areas. In particular, in the semiconductor industry the vision inspection system for wafers is a very important system. Also, inspection techniques for semiconductor wafer production are required to ensure high precision and fast inspection. In order to achieve these objectives, parallel processing of the inspection algorithm is essentially needed. In this paper, we propose the GPU (Graphical Processing Unit)-based parallel processing algorithm for the fast inspection of semiconductor wafers. The proposed algorithm is implemented on GPU boards made by NVIDIA Company. The defect detection performance of the proposed algorithm implemented on the GPU is the same as if by a single CPU, but the execution time of the proposed method is about 210 times faster than the one with a single CPU.

A Parallel Processing of Finding Neighbor Agents in Flocking Behaviors Using GPU (GPU를 이용한 무리 짓기에서 이웃 에이전트 찾기의 병렬 처리)

  • Lee, Jae-Moon
    • Journal of Korea Game Society
    • /
    • v.10 no.5
    • /
    • pp.95-102
    • /
    • 2010
  • This paper proposes a parallel algorithm of the flocking behaviors using GPU. To do this, we used CUDA as the parallel processing architecture of GPU and then analyzed its characteristics and constraints. Based on them, the paper improved the performance by parallelizing to find the neighbors for an agent which requires the largest cost in the flocking behaviors. We implemented the proposed algorithm on GTX 285 GPU and compared experimentally its performance with the original spatial partitioning method. The results of the comparison showed that the proposed algorithm outperformed the original method up to 9 times with respect to the execution time.

GPU Acceleration of Range Doppler Algorithm for Real-Time SAR Image Generation (실시간 SAR 영상 생성을 위한 Range Doppler Algorithm의 GPU 가속)

  • Dong-Min Jeong;Woo-Kyung Lee;Myeong-Jin Lee;Yun-Ho Jung
    • Journal of IKEEE
    • /
    • v.27 no.3
    • /
    • pp.265-272
    • /
    • 2023
  • In this paper, a GPU-accelerated kernel of range Doppler algorithm (RDA) was developed for real-time image formation based on frequency modulated continuous wave (FMCW) synthetic aperture radar (SAR). A pinned memory was used to minimize the data transfer time between the host and the GPU device, and the kernel was configured to perform all RDA operations on the GPU to minimize the number of data transfers. The dataset was obtained through the FMCW drone SAR experiment, and the GPU acceleration effect was measured in an intel i7-9700K CPU, 32GB RAM, and Nvidia RTX 3090 GPU environment. Including the data transfer time between host and devices, it was measured to be accelerated up to 3.41 times compared to the CPU, and when only the acceleration effect of operation was measured without including the data transfer time, it was confirmed that it could be accelerated up to 156 times.

GPU-based Stereo Matching Algorithm with the Strategy of Population-based Incremental Learning

  • Nie, Dong-Hu;Han, Kyu-Phil;Lee, Heng-Suk
    • Journal of Information Processing Systems
    • /
    • v.5 no.2
    • /
    • pp.105-116
    • /
    • 2009
  • To solve the general problems surrounding the application of genetic algorithms in stereo matching, two measures are proposed. Firstly, the strategy of simplified population-based incremental learning (PBIL) is adopted to reduce the problems with memory consumption and search inefficiency, and a scheme for controlling the distance of neighbors for disparity smoothness is inserted to obtain a wide-area consistency of disparities. In addition, an alternative version of the proposed algorithm, without the use of a probability vector, is also presented for simpler set-ups. Secondly, programmable graphics-hardware (GPU) consists of multiple multi-processors and has a powerful parallelism which can perform operations in parallel at low cost. Therefore, in order to decrease the running time further, a model of the proposed algorithm, which can be run on programmable graphics-hardware (GPU), is presented for the first time. The algorithms are implemented on the CPU as well as on the GPU and are evaluated by experiments. The experimental results show that the proposed algorithm offers better performance than traditional BMA methods with a deliberate relaxation and its modified version in terms of both running speed and stability. The comparison of computation times for the algorithm both on the GPU and the CPU shows that the former has more speed-up than the latter, the bigger the image size is.

GPU Algorithm for Outer Boundaries of a Triangle Set (GPU를 이용한 삼각형 집합의 외경계 계산 알고리즘)

  • Kyung, Min-Ho
    • Korean Journal of Computational Design and Engineering
    • /
    • v.17 no.4
    • /
    • pp.262-273
    • /
    • 2012
  • We present a novel GPU algorithm to compute outer cell boundaries of 3D arrangement subdivided by a given set of triangles. An outer cell boundary is defined as a 2-manifold surface consisting of subdivided polygons facing outward. Many geometric problems, such as Minkowski sum, sweep volume, lower/upper envelop, Bool operations, can be reduced to finding outer cell boundaries with specific properties. Computing outer cell boundaries, however, is a very time-consuming job and also is susceptible to numerical errors. To address these problems, we develop an algorithm based on GPU with a robust scheme combining interval arithmetic and multi-level precisions. The proposed algorithm is tested on Minkowski sum of several polygonal models, and shows 5-20 times speedup over an existing algorithm running on CPU.

Matrix Addition & Scalar Multiplication on the GPU (GPU 기반 행렬 덧셈 및 스칼라 곱셈 알고리즘)

  • Park, Sangkun
    • Journal of Institute of Convergence Technology
    • /
    • v.8 no.1
    • /
    • pp.15-20
    • /
    • 2018
  • Recently a GPU has acquired programmability to perform general purpose computation fast by running thousands of threads concurrently. This paper presents a parallel GPU computation algorithm for dense matrix-matrix addition and scalar multiplication using OpenGL compute shader. It can play a very important role as a fundamental building block for many high-performance computing applications. Experimental results on NVIDIA Quad 4000 show that the proposed algorithm runs 21 times faster than CPU algorithm and achieves performance of 16 GFLOPS in single precision for dense matrices with size 4,096. Such performance proves that our algorithm is practical for real applications.

Accurate and efficient GPU ray-casting algorithm for volume rendering of unstructured grid data

  • Gu, Gibeom;Kim, Duksu
    • ETRI Journal
    • /
    • v.42 no.4
    • /
    • pp.608-618
    • /
    • 2020
  • We present a novel GPU-based ray-casting algorithm for volume rendering of unstructured grid data. Our volume rendering system uses a ray-casting method that guarantees accurate rendering results. We also employ the per-pixel intersection list concept in the Bunyk algorithm to guarantee an accurate result for non-convex meshes. For efficient memory access for the lists on the GPU, we represent the intersection lists for all faces as an array with our novel construction algorithm. With the intersection lists, we perform ray-casting on a GPU, and a GPU thread handles each ray. To increase ray-coherency in a thread block and improve memory access efficiency, we extend a prior image-tile-based work distribution method to fit modern GPU architectures. We also show that a prior approach using a per-thread local buffer to reduce redundant computation is not appropriate for modern GPU architectures. Instead, we take an on-demand calculation strategy that achieves better performance even though it allows duplicate computations. We applied our method to three unstructured grid datasets with different characteristics. With a GPU, our method achieved up to 36.5 times higher performance for the ray-casting process and 19.7 times higher performance for the whole volume rendering process compared with the Bunyk algorithm using a CPU core. Also, our approach showed up to 8.2 times higher performance than a GPU-based cell projection method while generating more accurate rendering results. These results demonstrate the efficiency and accuracy of our method.