• Title/Summary/Keyword: Cuda

Search Result 294, Processing Time 0.165 seconds

A Study on a Declines in Performance by Memory Copy in CUDA (CUDA의 메모리 복사로 인한 성능 저하 연구)

  • Kang, Jihun;Lee, DaeWon;Kang, InSung;Yu, HeonChang
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.11a
    • /
    • pp.135-138
    • /
    • 2013
  • GPGPU(General Purpose Graphics Processing Unit) 병렬처리 시스템인 CUDA(Compute Unified Device Architecture)는 컴퓨터에서의 고속 연산 처리를 위해 많이 사용되어왔다. CUDA에서 연산 처리를 하기 위해서는 CUDA의 특성을 이해해야 한다. CUDA는 CPU(Central Processing Unit)가 처리하는 Host 영역과 GPU(Graphics Processing Unit)가 처리하는 영역인 Device 영역이 존재하며, 이 두 영역간의 데이터 복사를 통해 연산 처리를 진행한다. 이런 구조적인 특성상 메인 메모리에서 GPU 메모리로 입력 데이터를 전달해야 GPU를 이용해 연산을 처리할 수 있는 구조를 가지고 있다. 하지만 이러한 처리 구조로 인해 연산 시간과 별도로 메인 메모리와 GPU 메모리간의 데이터 복사시간이 존재하며, 추가적으로 발생하는 메모리 복사 시간으로 인해 오버헤드가 발생하게 된다. 본 논문에서는 실험을 통해 메모리 복사 시간, 연산의 반복 횟수 그리고 연산의 복잡성이 전체 성능에 어떤 영향을 미치는지 논하고자 한다.

Parallel Intersection Detection Algorithm using CUDA (CUDA 를 이용한 가상 객체들간의 병렬 충돌 검사 알고리즘)

  • Lee, Yeon-Hee;Kim, Young-J.
    • 한국HCI학회:학술대회논문집
    • /
    • 2008.02a
    • /
    • pp.451-455
    • /
    • 2008
  • In this paper, we present how we implement the low-level, triangle intersection routine in a massively parallel fashion using n VIDIA's new GPGPU language, CUDA. Triangle intersection often becomes a computational bottleneck in the collision detection problem. Due to the relatively low bandwidth between CPU and GPU, it has been challenging to implement efficient, object-space collision detection between triangle sets. However, thanks to the improved data transmission rates in CUDA architecture, in this paper, we improved the performance of triangle intersection substantially better than the optimized CPU counterpart.

  • PDF

Efficient Parallel CUDA Random Number Generator on NVIDIA GPUs (NVIDIA GPU 상에서의 난수 생성을 위한 CUDA 병렬프로그램)

  • Kim, Youngtae;Hwang, Gyuhyeon
    • Journal of KIISE
    • /
    • v.42 no.12
    • /
    • pp.1467-1473
    • /
    • 2015
  • In this paper, we implemented a parallel random number generation program on GPU's, which are known for high performance computing, using LCG (Linear Congruential Generator). Random numbers are important in all fields requiring the use of randomness, and LCG is one of the most widely used methods for the generation of pseudo-random numbers. We explained the parallel program using the NVIDIA CUDA model and MPI(Message Passing Interface) and showed uniform distribution and performance results. We also used a Monte Carlo algorithm to calculate pi(${\pi}$) comparing the parallel random number generator with cuRAND, which is a CUDA library function, and showed that our program is much more efficient. Finally we compared performance results using multi-GPU's with those of ideal speedups.

Improving 3D Measurement Speed using CUDA (CUDA를 이용한 3D 측정 속도 향상)

  • Kim, Ho-Joong;Cho, Tai-Hoon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.331-334
    • /
    • 2017
  • Recently, a method using a fringe pattern is widely used for 3D measurements. This is a method of measuring by using a phase value obtained by projecting a pattern to an object to be measured. This method requires many operations such as calculating the phase value and calculating the height. It takes a lot of time depending on the amount of computation. In this paper, we present a method using NVIDIA's CUDA to reduce this time. And we introduce the method of calculating phase value and height. It also shows the exact time difference between the CPU version and the CUDA version. This method is very effective because it can process the same operation in a shorter time.

  • PDF

Parallel Implementation of SPECK, SIMON and SIMECK by Using NVIDIA CUDA PTX (NVIDIA CUDA PTX를 활용한 SPECK, SIMON, SIMECK 병렬 구현)

  • Jang, Kyung-bae;Kim, Hyun-jun;Lim, Se-jin;Seo, Hwa-jeong
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.31 no.3
    • /
    • pp.423-431
    • /
    • 2021
  • SPECK and SIMON are lightweight block ciphers developed by NSA(National Security Agency), and SIMECK is a new lightweight block cipher that combines the advantages of SPECK and SIMON. In this paper, a large-capacity encryption using SPECK, SIMON, and SIMECK is implemented using a GPU with efficient parallel processing. CUDA library provided by NVIDIA was used, and performance was maximized by using CUDA assembly language PTX to eliminate unnecessary operations. When comparing the results of the simple CPU implementation and the implementation using the GPU, it was possible to perform large-scale encryption at a faster speed. In addition, when comparing the implementation using the C language and the implementation using the PTX when implementing the GPU, it was confirmed that the performance increased further when using the PTX.

A Comparison among Methods using CUDA Programming (CUDA 프로그래밍 기법 비교 연구)

  • Ihm, Sun-Young;Park, Young-Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.05a
    • /
    • pp.138-139
    • /
    • 2013
  • GPU 를 활용하는 병렬 프로그래밍에 대한 관심이 높아지면서 이에 대한 연구가 활발히 진행되고 있다. GPU 의 성능이 높아지면서 이를 일반 연산에 사용하는 방법으로 NVIDIA 사에서 CUDA 프로그래밍 개발 환경을 제공하고 있다. 본 논문에서는 이 CUDA 프로그래밍 기법을 소개하고, 간단한 예제를 통해 CPU 와 GPU 를 사용하는 방법을 비교한다.

An efficient acceleration algorithm of GPU ray tracing using CUDA (CUDA를 이용한 효과적인 GPU 광선추적 가속 알고리즘)

  • Ji, Joong-Hyun;Yun, Dong-Ho;Ko, Kwang-Hee
    • 한국HCI학회:학술대회논문집
    • /
    • 2009.02a
    • /
    • pp.469-474
    • /
    • 2009
  • This paper proposes an real time ray tracing system using optimized kd-tree traversal environment and ray/triangle intersection algorithm. The previous kd-tree traversal algorithms search for the upper nodes in a bottom-up manner. In a such way we need to revisit the already visited parent node or use redundant memory after failing to find the intersected primitives in the leaf node. Thus ray tracing for relatively complex scenes become more difficult. The new algorithm contains stacks implemented on GPU's local memory on CUDA framework, thus elegantly eliminate the problems of previous algorithms. After traversing the node we perform the latest CPU-based ray/triangle intersection algorithm 'Plucker coordinate test', which is further accelerated in massively parallel thanks to CUDA. Plucker test can drastically reduce the computational costs since it does not use barycentric coordinates but only simple test using the relations between a ray and the triangle edges. The entire system is consist of a single ray kernel simply and implemented without introduction of complicated synchronization or ray packets. Consequently our experiment shows the new algorithm can is roughly twice as faster as the previous.

  • PDF

Accelerated Numerical Computations of Antennas Using OpenMP, MPI, CUDA (OpenMP, MPI, CUDA를 이용한 안테나 수치 계산 가속화)

  • Cho, Yong-Heui
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2014.11a
    • /
    • pp.41-42
    • /
    • 2014
  • 대형 안테나 해석에서 큰 화두는 안테나 수치 계산 속도를 높이는 것이다. 현재 인기를 얻고 있는 병렬 처리 방식인 OpenMP, MPI, CUDA를 이용하여 안테나 수치 계산을 병렬화할 경우 발생하는 단점을 제시하고, 각 병렬 처리법의 장점도 소개한다.

  • PDF

CUDA 기반 병렬 Haar Transform 고찰

  • Lee, Sang-Il;Park, Neung-Soo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.11a
    • /
    • pp.249-250
    • /
    • 2009
  • 음향 신호 이미지 처리, human vision 등의 분야에서 널리 쓰이는 wavelet transform의 가장 간단한 형태인 one-dimension haar wavelet transform을 CUDA로 구현하고 hardware 특성을 살린 optimizing을 함으로써 Data-parallelism의 성능과 CUDA memory architecture의 상관 관계를 살펴본다.