• Title/Summary/Keyword: Nvidia CUDA

Search Result 80, Processing Time 0.025 seconds

GP-GPU based Parallelization for Urban Terrain Atmospheric Model CFD_NIMR (도시기상모델 CFD_NIMR의 GP-GPU 실행을 위한 병렬 프로그램의 구현)

  • Kim, Youngtae;Park, Hyeja;Choi, Young-Jeen
    • Journal of Internet Computing and Services
    • /
    • v.15 no.2
    • /
    • pp.41-47
    • /
    • 2014
  • In this paper, we implemented a CUDA Fortran parallel program to run the CFD_NIMR model on GP-GPU's, which simulates air diffusion on urban terrains. A GP-GPU is graphic processing unit in the form of a PCI card, and a general calculation accelerator to perform a large amount of high speed calculations with low cost and electric power. The GP-GPU gives performance enhancement of speed by 15 times to compare the Nvidia Tesla C1060 GPU with Intel XEON 2.0 GHz CPU. In addition, the program on a GP-GPU shows efficient performance compared to an MPI parallel program on multiple CPU's. It is expected that a proposed programming method on the GP-GPU parallel program can be used for numerical models with a similar structure.

Performance Comparison of Join Operations Parallelization by using GPGPU (GPGPU 기반 조인 연산 병렬화 성능 비교)

  • Lee, Jong-Sub;Lee, Sang-Back;Lee, Kyu-Chul
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.28-44
    • /
    • 2018
  • In a database system, the most expensive operation among relational operations is a join operation. Generally, CPU-based join operations uses parallel processing with either 1 core or 16 cores at most, which does not significantly improve the function. On the other hand, GPGPU(General-Purpose computing on Graphics Processing Units) allows parallel processing through thousands of processing units, greatly reducing the time required to perform join operations. Parallelization of the operation using GPGPU uses NVIDIA's CUDA SDK. In this paper, we implement parallelization of the join operation using GPGPU and compare the performances. The used join operations are Nested Loop Join (NLJ), Sort Merge Join (SMJ) and Hash Join (HJ), and GPGPU equipment uses TITAN Xp, GTX 1080 Ti and GTX 1080. We measure and compare the performance of join operations based on CPU and GPGPU. We compare this performance with the performance of the previous study on the join operation based on GPGPU. The results of experiment show that the performance based on GPGPU is 6~328 times faster than the one based on CPU.

Speed-optimized Implementation of HIGHT Block Cipher Algorithm (HIGHT 블록 암호 알고리즘의 고속화 구현)

  • Baek, Eun-Tae;Lee, Mun-Kyu
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.22 no.3
    • /
    • pp.495-504
    • /
    • 2012
  • This paper presents various speed optimization techniques for software implementation of the HIGHT block cipher on CPUs and GPUs. We considered 32-bit and 64-bit operating systems for CPU implementations. After we applied the bit-slicing and byte-slicing techniques to HIGHT, the encryption speed recorded 1.48Gbps over the intel core i7 920 CPU with a 64-bit operating system, which is up to 2.4 times faster than the previous implementation. We also implemented HIGHT on an NVIDIA GPU equipped with CUDA, and applied various optimization techniques, such as storing most frequently used data like subkeys and the F lookup table in the shared memory; and using coalesced access when reading data from the global memory. To our knowledge, this is the first result that implements and optimizes HIGHT on a GPU. We verified that the byte-slicing technique guarantees a speed-up of more than 20%, resulting a speed which is 31 times faster than that on a CPU.

Optimizing Skyline Query Processing Algorithms on CUDA Framework (CUDA 프레임워크 상에서 스카이라인 질의처리 알고리즘 최적화)

  • Min, Jun;Han, Hwan-Soo;Lee, Sang-Won
    • Journal of KIISE:Databases
    • /
    • v.37 no.5
    • /
    • pp.275-284
    • /
    • 2010
  • GPUs are stream processors based on multi-cores, which can process large data with a high speed and a large memory bandwidth. Furthermore, GPUs are less expensive than multi-core CPUs. Recently, usage of GPUs in general purpose computing has been wide spread. The CUDA architecture from Nvidia is one of efforts to help developers use GPUs in their application domains. In this paper, we propose techniques to parallelize a skyline algorithm which uses a simple nested loop structure. In order to employ the CUDA programming model, we apply our optimization techniques to make our skyline algorithm fit into the performance restrictions of the CUDA architecture. According to our experimental results, we improve the original skyline algorithm by 80% with our optimization techniques.

Fundamental Function Design of Real-Time Unmanned Monitoring System Applying YOLOv5s on NVIDIA TX2TM AI Edge Computing Platform

  • LEE, SI HYUN
    • International journal of advanced smart convergence
    • /
    • v.11 no.2
    • /
    • pp.22-29
    • /
    • 2022
  • In this paper, for the purpose of designing an real-time unmanned monitoring system, the YOLOv5s (small) object detection model was applied on the NVIDIA TX2TM AI (Artificial Intelligence) edge computing platform in order to design the fundamental function of an unmanned monitoring system that can detect objects in real time. YOLOv5s was applied to the our real-time unmanned monitoring system based on the performance evaluation of object detection algorithms (for example, R-CNN, SSD, RetinaNet, and YOLOv5). In addition, the performance of the four YOLOv5 models (small, medium, large, and xlarge) was compared and evaluated. Furthermore, based on these results, the YOLOv5s model suitable for the design purpose of this paper was ported to the NVIDIA TX2TM AI edge computing system and it was confirmed that it operates normally. The real-time unmanned monitoring system designed as a result of the research can be applied to various application fields such as an security or monitoring system. Future research is to apply NMS (Non-Maximum Suppression) modification, model reconstruction, and parallel processing programming techniques using CUDA (Compute Unified Device Architecture) for the improvement of object detection speed and performance.

Design of Omok AI using Genetic Algorithm and Game Trees and Their Parallel Processing on the GPU (유전 알고리즘과 게임 트리를 병합한 오목 인공지능 설계 및 GPU 기반 병렬 처리 기법)

  • Ahn, Il-Jun;Park, In-Kyu
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.2
    • /
    • pp.66-75
    • /
    • 2010
  • This paper proposes an efficient method for design and implementation of the artificial intelligence (AI) of 'omok' game on the GPU. The proposed AI is designed on a cooperative structure using min-max game tree and genetic algorithm. Since the evaluation function needs intensive computation but is independently performed on a lot of candidates in the solution space, it is computed on the GPU in a massive parallel way. The implementation on NVIDIA CUDA and the experimental results show that it outperforms significantly over the CPU, in which parallel game tree and genetic algorithm on the GPU runs more than 400 times and 300 times faster than on the CPU. In the proposed cooperative AI, selective search using genetic algorithm is performed subsequently after the full search using game tree to search the solution space more efficiently as well as to avoid the thread overflow. Experimental results show that the proposed algorithm enhances the AI significantly and makes it run within the time limit given by the game's rule.

The performance of fast view synthesis using GPU (GPU를 이용한 고속 영상 합성 기법의 성능)

  • Kim, Jaehan;Shin, Hong-Chang;Cheong, Won-Sik;Bang, Gun
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2011.07a
    • /
    • pp.22-24
    • /
    • 2011
  • 본 논문에서는 3차원 디스플레이 시스템에서 다수의 중간 시점 영상을 실시간으로 생성할 수 있도록 GPU 기반의 고속 영상 합성기법을 제안하였으며 그에 대한 성능을 알아본다. 카메라의 기하 정보 및 참조 영상들의 깊이 정보를 이용하여 중간 시점 영상을 생성하였으며, 영상 합성 방법을 GPU에서 병렬 처리함으로써 고속화할 수 있었다. GPU를 효율적으로 다루기 위해 NVIDIA사의 CUDA(Compute Unified Device Architecture)$^TM$를 이용하였다. 제안한 기법은 CUDA의 SIMD(Single Instruction MUltiple Data) 구조를 사용하여 중간 영상 합성을 처리할 수 있도록 설계하였다. 본 논문은 고속 영상 합성에 중점을 두었고, 제안한 고속화 기법의 결과를 분석함으로써 다시점 3차원 디스플레이 시스템의 적용 가능성을 알아본다.

  • PDF

EFFICIENT COMPUTATION OF COMPRESSIBLE FLOW BY HIGHER-ORDER METHOD ACCELERATED USING GPU (고차 정확도 수치기법의 GPU 계산을 통한 효율적인 압축성 유동 해석)

  • Chang, T.K.;Park, J.S.;Kim, C.
    • Journal of computational fluids engineering
    • /
    • v.19 no.3
    • /
    • pp.52-61
    • /
    • 2014
  • The present paper deals with the efficient computation of higher-order CFD methods for compressible flow using graphics processing units (GPU). The higher-order CFD methods, such as discontinuous Galerkin (DG) methods and correction procedure via reconstruction (CPR) methods, can realize arbitrary higher-order accuracy with compact stencil on unstructured mesh. However, they require much more computational costs compared to the widely used finite volume methods (FVM). Graphics processing unit, consisting of hundreds or thousands small cores, is apt to massive parallel computations of compressible flow based on the higher-order CFD methods and can reduce computational time greatly. Higher-order multi-dimensional limiting process (MLP) is applied for the robust control of numerical oscillations around shock discontinuity and implemented efficiently on GPU. The program is written and optimized in CUDA library offered from NVIDIA. The whole algorithms are implemented to guarantee accurate and efficient computations for parallel programming on shared-memory model of GPU. The extensive numerical experiments validates that the GPU successfully accelerates computing compressible flow using higher-order method.

Performance Study of Satellite Image Processing on Graphics Processors Unit Using CUDA

  • Jeong, In-Kyu;Hong, Min-Gee;Hahn, Kwang-Soo;Choi, Joonsoo;Kim, Choen
    • Korean Journal of Remote Sensing
    • /
    • v.28 no.6
    • /
    • pp.683-691
    • /
    • 2012
  • High resolution satellite images are now widely used for a variety of mapping applications including photogrammetry, GIS data acquisition and visualization. As the spectral and spatial data size of satellite images increases, a greater processing power is needed to process the images. The solution of these problems is parallel systems. Parallel processing techniques have been developed for improving the performance of image processing along with the development of the computational power. However, conventional CPU-based parallel computing is often not good enough for the demand for computational speed to process the images. The GPU is a good candidate to achieve this goal. Recently GPUs are used in the field of highly complex processing including many loop operations such as mathematical transforms, ray tracing. In this study we proposed a technique for parallel processing of high resolution satellite images using GPU. We implemented a spectral radiometric processing algorithm on Landsat-7 ETM+ imagery using CUDA, a parallel computing architecture developed by NVIDIA for GPU. Also performance of the algorithm on GPU and CPU is compared.

CUDA based parallel design of a shot change detection algorithm using frame segmentation and object movement

  • Kim, Seung-Hyun;Lee, Joon-Goo;Hwang, Doo-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.9-16
    • /
    • 2015
  • This paper proposes the parallel design of a shot change detection algorithm using frame segmentation and moving blocks. In the proposed approach, the high parallel processing components, such as frame histogram calculation, block histogram calculation, Otsu threshold setting function, frame moving operation, and block histogram comparison, are designed in parallel for NVIDIA GPU. In order to minimize memory access delay time and guarantee fast computation, the output of a GPU kernel becomes the input data of another kernel in a pipeline way using the shared memory of GPU. In addition, the optimal sizes of CUDA processing blocks and threads are estimated through the prior experiments. In the experimental test of the proposed shot change detection algorithm, the detection rate of the GPU based parallel algorithm is the same as that of the CPU based algorithm, but the average of processing time speeds up about 6~8 times.