• Title/Summary/Keyword: CUDA Implementation

Search Result 67, Processing Time 0.034 seconds

Parallel Implementation of SPECK, SIMON and SIMECK by Using NVIDIA CUDA PTX (NVIDIA CUDA PTX를 활용한 SPECK, SIMON, SIMECK 병렬 구현)

  • Jang, Kyung-bae;Kim, Hyun-jun;Lim, Se-jin;Seo, Hwa-jeong
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.31 no.3
    • /
    • pp.423-431
    • /
    • 2021
  • SPECK and SIMON are lightweight block ciphers developed by NSA(National Security Agency), and SIMECK is a new lightweight block cipher that combines the advantages of SPECK and SIMON. In this paper, a large-capacity encryption using SPECK, SIMON, and SIMECK is implemented using a GPU with efficient parallel processing. CUDA library provided by NVIDIA was used, and performance was maximized by using CUDA assembly language PTX to eliminate unnecessary operations. When comparing the results of the simple CPU implementation and the implementation using the GPU, it was possible to perform large-scale encryption at a faster speed. In addition, when comparing the implementation using the C language and the implementation using the PTX when implementing the GPU, it was confirmed that the performance increased further when using the PTX.

Efficient CUDA Implementation of Multiple Planes Fitting Using RANSAC (RANSAC을 이용한 다중 평면 피팅의 효율적인 CUDA 구현)

  • Cho, Tai-Hoon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.4
    • /
    • pp.388-393
    • /
    • 2019
  • As a fiiting method to data with outliers, RANSAC(RANdom SAmple Consensus) based algorithm is widely used in fitting of line, circle, ellipse, etc. CUDA is currently most widely used GPU with massive parallel processing capability. This paper proposes an efficient CUDA implementation of multiple planes fitting using RANSAC with 3d points data, of which one set of 3d points is used for one plane fitting. The performance of the proposed algorithm is demonstrated compared with CPU implementation using both artificially generated data and real 3d heights data of a PCB. The speed-up of the algorithm over CPU seems to be higher in data with lower inlier ratio, more planes to fit, and more points per plane fitting. This method can be easily applied to a wide variety of other fitting applications.

Implementation of Particle Swarm Optimization Method Using CUDA (CUDA를 이용한 Particle Swarm Optimization 구현)

  • Kim, Jo-Hwan;Kim, Eun-Su;Kim, Jong-Wook
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.5
    • /
    • pp.1019-1024
    • /
    • 2009
  • In this paper, particle swarm optimization(PSO) is newly implemented by CUDA(Compute Unified Device Architecture) and is applied to function optimization with several benchmark functions. CUDA is not CPU but GPU(Graphic Processing Unit) that resolves complex computing problems using parallel processing capacities. In addition, CUDA helps one to develop GPU softwares conveniently. Compared with the optimization result of PSO executed on a general CPU, CUDA saves about 38% of PSO running time as average, which implies that CUDA is a promising frame for real-time optimization and control.

CUDA-based Fast DRR Generation for Analysis of Medical Images (의료영상 분석을 위한 CUDA 기반의 고속 DRR 생성 기법)

  • Yang, Sang-Wook;Choi, Young;Koo, Seung-Bum
    • Korean Journal of Computational Design and Engineering
    • /
    • v.16 no.4
    • /
    • pp.285-291
    • /
    • 2011
  • A pose estimation process from medical images is calculating locations and orientations of objects obtained from Computed Tomography (CT) volume data utilizing X-ray images from two directions. In this process, digitally reconstructed radiograph (DRR) images of spatially transformed objects are generated and compared to X-ray images repeatedly until reasonable transformation matrices of the objects are found. The DRR generation and image comparison take majority of the total time for this pose estimation. In this paper, a fast DRR generation technique based on GPU parallel computing is introduced. A volume ray-casting algorithm is explained with brief vector operations and a parallelization technique of the algorithm using Compute Unified Device Architecture (CUDA) is discussed. This paper also presents the implementation results and time measurements comparing to those from pure-CPU implementation and open source toolkit.

Optimization of Warp-wide CUDA Implementation for Parallel Shifted Sort Algorithm (병렬 Shifted Sort 알고리즘의 Warp 단위 CUDA 구현 최적화)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.4
    • /
    • pp.739-745
    • /
    • 2017
  • This paper presents and discusses an implementation of the GPU shifted sorting method to find approximate k nearest neighbors which executes within "warp", the minimum execution unit in GPU parallel architecture. Also, this paper presents the comparison results with other two common nearest neighbor searching methods, GPU-based kd-tree and ANN (Approximate Nearest Neighbor) library. The proposed implementation focuses on the cases when k is small, i.e. 2, 4, 8, and 16, which are handled efficiently within warp to consider it is very common for applications to handle small k's. Also, this paper discusses optimization ways to implementation by improving memory management in a loop for the CUB open library and adopting CUDA commands which are supported by GPU hardware. The proposed implementation shows more than 16-fold speed-up against GPU-based other methods in the tests, implying that the improvement would become higher for more larger input data.

Analysis of GPU-based Parallel Shifted Sort Algorithm by comparing with General GPU-based Tree Traversal (일반적인 GPU 트리 탐색과의 비교실험을 통한 GPU 기반 병렬 Shifted Sort 알고리즘 분석)

  • Kim, Heesu;Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.6
    • /
    • pp.1151-1156
    • /
    • 2017
  • It is common to achieve lower performance in traversing tree data structures in GPU than one expects. In this paper, we analyze the reason of lower-than-expected performance in GPU tree traversal and present that the warp divergences is caused by the branch instructions ("if${\ldots}$ else") which appear commonly in tree traversal CUDA codes. Also, we compare the parallel shifted sort algorithm which can reduce the number of warp divergences with a kd-tree CUDA implementation to show that the shifted sort algorithm can work faster than the kd-tree CUDA implementation thanks to less warp divergences. As the analysis result, the shifted sort algorithm worked about 16-fold faster than the kd-tree CUDA implementation for $2^{23}$ query points and $2^{23}$ data points in $R^3$ space. The performance gaps tend to increase in proportion to the number of query points and data points.

Optimal Implementation of Lightweight Block Cipher PIPO on CUDA GPGPU (CUDA GPGPU 상에서 경량 블록 암호 PIPO의 최적 구현)

  • Kim, Hyun-Jun;Eum, Si-Woo;Seo, Hwa-Jeong
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.32 no.6
    • /
    • pp.1035-1043
    • /
    • 2022
  • With the spread of the Internet of Things (IoT), cloud computing, and big data, the need for high-speed encryption for applications is emerging. GPU optimization can be used to validate cryptographic analysis results or reduced versions theoretically obtained by the GPU in a reasonable time. In this paper, PIPO lightweight encryption implemented in various environments was implemented on GPU. Optimally implemented considering the brute force attack on PIPO. In particular, the optimization implementation applying the bit slicing technique and the GPU elements were used as much as possible. As a result, the implementation of the proposed method showed a throughput of about 19.5 billion per second in the RTX 3060 environment, achieving a throughput of about 122 times higher than that of the previous study.

Time Complexity Measurement on CUDA-based GPU Parallel Architecture of Morphology Operation

  • Izmantoko, Yonny S.;Choi, Heung-Kook
    • Journal of Korea Multimedia Society
    • /
    • v.16 no.4
    • /
    • pp.444-452
    • /
    • 2013
  • Operation time of a function or procedure is a thing that always needs to be optimized. Parallelizing the operation is the general method to reduce the operation time of the function. One of the most powerful parallelizing methods is using GPU. In image processing field, one of the most commonly used operations is morphology operation. Three types of morphology operations kernel, na$\ddot{i}$ve, global and shared, are presented in this paper. All kernels are made using CUDA and work parallel on GPU. Four morphology operations (erosion, dilation, opening, and closing) using square structuring element are tested on MRI images with different size to measure the speedup of the GPU implementation over CPU implementation. The results show that the speedup of dilation is similar for all kernels. However, on erosion, opening, and closing, shared kernel works faster than other kernels.

Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture (GPU 하드웨어 아키텍처 기반 sub-warp 단위 병렬 프리픽스(prefix) 연산의 정확한 구현)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.3
    • /
    • pp.613-619
    • /
    • 2017
  • This paper presents a CUDA (Compute Unified Device Architecture) code to achieve correct GPU parallel segmented prefix operation results with less than 32 segment length for large data arrays. Mark Harris and Michael Garland had published CUDA code to address the tasks. This paper shows that their code does not generate correct results when the local segment length is less than 32, discusses the cause of the problem, and presents a CUDA code that generates correct results. The segmented parallel prefix operation presented in this paper can be applied as a building block to various large parallel processing algorithms including the k-nearest neighbor search problems.