• Title/Summary/Keyword: Cuda

Search Result 294, Processing Time 0.026 seconds

High Resolution Depth-map Estimation in Real-time using Efficient Multi-threading (효율적인 멀티 쓰레딩을 이용한 고해상도 깊이지도의 실시간 획득)

  • Cho, Chil-Suk;Jun, Ji-In;Choo, Hyon-Gon;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.17 no.6
    • /
    • pp.945-953
    • /
    • 2012
  • A depth map can be obtained by projecting/capturing patterns of stripes using a projector-camera system and analyzing the geometric relationship between the projected patterns and the captured patterns. This is usually called structured light technique. In this paper, we propose a new multi-threading scheme for accelerating a conventional structured light technique. On CPUs and GPUs, multi-threading can be implemented by using OpenMP and CUDA, respectively. However, the problem is that their performance changes according to the computational conditions of partial processes of a structured light technique. In other words, OpenMP (using multiple CPUs) outperformed CUDA (using multiple GPUs) in partial processes such as pattern decoding and depth estimation. In contrast, CUDA outperformed OpenMP in partial processes such as rectification and pattern segmentation. Therefore, we carefully analyze the computational conditions where each outperforms the other and do use the better one in the related conditions. As a result, the proposed method can estimate a depth map in a speed of over 25 fps on $1280{\times}800$ images.

Fast View Synthesis Using GPGPU (GPGPU를 이용한 고속 영상 합성 기법)

  • Shin, Hong-Chang;Park, Han-Hoon;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.13 no.6
    • /
    • pp.859-874
    • /
    • 2008
  • In this paper, we develop a fast view synthesis method that generates multiple intermediate views in real-time for the 3D display system when the camera geometry and depth map of reference views are given in advance. The proposed method achieves faster view synthesis than previous approaches in GPU by processing in parallel the entire computations required for the view synthesis. Specifically, we use $CUDA^{TM}$ (by NVIDIA) to control GPU device. For increasing the processing speed, we adapted all the processes for the view synthesis to single instruction multiple data (SIMD) structure that is a main feature of CUDA, maximized the use of the high-speed memories on GPU device, and optimized the implementation. As a result, we could synthesize 9 intermediate view images with the size of 720 by 480 pixels within 0.128 second.

Optimization of Color Format Conversion of WebCam Images Using the CUDA (CUDA를 이용한 웹캠 영상의 색상 형식 변환 최적화)

  • Kim, Jin-Woo;Jung, Yun-Hye;Park, Jin-Hong;Park, Yong-Jin;Han, Tack-Don
    • Journal of Korea Game Society
    • /
    • v.11 no.1
    • /
    • pp.147-157
    • /
    • 2011
  • Webcam doesn't perform memory-alignment in order to reduce the transmission time of image data. Memory-unaligned image data is unsuitable for the processing on GPU. Accordingly, we convert it to available color format for optimization in high speed image processing. In this paper, we propose a technique that accelerates webcam's color format conversion by using NVDIA CUDA. We propose an optimization which is about memory accesses and thread composition, also evaluate memory and computing performance for verifying a hypothesis which is the performance of the proposed architecture and optimizing degree on low-performance GPU. Following the optimization technique, we show performance improvements over maximum 68 percent.

A Parallel Bulk Loading Method for $B^+$-Tree Using CUDA (CUDA를 활용한 병렬 $B^+$-트리 벌크로드 기법)

  • Sung, Joo-Ho;Lee, Yoon-Woo;Han, A;Choi, Won-Ik;Kwon, Dong-Seop
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.6
    • /
    • pp.707-711
    • /
    • 2010
  • Most relational database systems provide $B^+$-trees as their main index structures, and use bulk-loading techniques for creating new $B^+$-trees on existing data from scratch. Although bulk loadings are more effective than inserting keys one by one, they are still time-consuming because they have to sort all the keys from large data. To improve the performance of bulk loadings, this paper proposes an efficient parallel bulk loading method for $B^+$-trees based on CUDA, which is a parallel computing architecture developed by NVIDIA to utilize computing powers of graphic processor units for general purpose computing. Experimental results show that the proposed method enhance the performance more than 70 percents compared to existing bulk loading methods.

Acceleration Method for Integral Imaging Generation of Volume Data based on CUDA (CUDA를 기반한 볼륨데이터의 집적영상 생성을 위한 고속화 기법)

  • Park, Chan;Jeong, Ji-Seong;Park, Jae-Hyeung;Kwon, Ki-Chul;Kim, Nam;Yoo, Kwan-Hee
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.3
    • /
    • pp.9-17
    • /
    • 2011
  • Recently, with the advent of stereoscopic 3D TV, the activation of 3D stereoscopic content is expected. Research on 3D auto stereoscopic display has been carried out to relieve discomfort of 3D stereoscopic display. In this research, it is necessary to generate the elemental image from a lens array. As the number of lens in a lens array is increased, it takes a lot of time to generate the elemental image, and it will take more time for a large volume data. In order to improve the problem, in this paper, we propose a method to generate the elemental image by using OpenCL based on CUDA. We perform our proposed method on PC environment with one of Tesla C1060, Geforce 9800GT and Quadro FX 3800 graphics cards. Experimental results show that the proposed method can obtain almost 20 times better performance than recent research result[11].

Development and Speed Comparison of Convolutional Neural Network Using CUDA (CUDA를 이용한 Convolutional Neural Network의 구현 및 속도 비교)

  • Ki, Cheol-min;Cho, Tai-Hoon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.335-338
    • /
    • 2017
  • Currently Artificial Inteligence and Deep Learning are social issues, and These technologies are applied to various fields. A good method among the various algorithms in Artificial Inteligence is Convolutional Neural Network. Convolutional Neural Network is a form that adds convolution layers that extracts features by convolution operation on a general neural network method. If you use Convolutional Neural Network as small amount of data, or if the structure of layers is not complicated, you don't have to pay attention to speed. But the learning time is long as the size of the learning data is large and the structure of layers is complicated. So, GPU-based parallel processing is a lot. In this paper, we developed Convolutional Neural Network using CUDA and Learning speed is faster and more efficient than the method using the CPU.

  • PDF

Development of GPU-accelerated kinematic wave model using CUDA fortran (CUDA fortran을 이용한 GPU 가속 운동파모형 개발)

  • Kim, Boram;Park, Seonryang;Kim, Dae-Hong
    • Journal of Korea Water Resources Association
    • /
    • v.52 no.11
    • /
    • pp.887-894
    • /
    • 2019
  • We proposed a GPU (Grapic Processing Unit) accelerated kinematic wave model for rainfall runoff simulation and tested the accuracy and speed up performance of the proposed model. The governing equations are the kinematic wave equation for surface flow and the Green-Ampt model for infiltration. The kinematic wave equations were discretized using a finite volume method and CUDA fortran was used to implement the rainfall runoff model. Several numerical tests were conducted. The computed results of the GPU accelerated kinematic wave model were compared with several measured and other numerical results and reasonable agreements were observed from the comparisons. The speed up performance of the GPU accelerated model increased as the number of grids increased, achieving a maximum speed up of approximately 450 times compared to a CPU (Central Processing Unit) version, at least for the tested computing resources.

High-Speed Implementations of Block Ciphers on Graphics Processing Units Using CUDA Library (GPU용 연산 라이브러리 CUDA를 이용한 블록암호 고속 구현)

  • Yeom, Yong-Jin;Cho, Yong-Kuk
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.18 no.3
    • /
    • pp.23-32
    • /
    • 2008
  • The computing power of graphics processing units(GPU) has already surpassed that of CPU and the gap between their powers is getting wider. Thus, research on GPGPU which applies GPU to general purpose becomes popular and shows great success especially in the field of parallel data processing. Since the implementation of cryptographic algorithm using GPU was started by Cook et at. in 2005, improved results using graphic libraries such as OpenGL and DirectX have been published. In this paper, we present skills and results of implementing block ciphers using CUDA library announced by NVIDIA in 2007. Also, we discuss a general method converting source codes of block ciphers on CPU to those on GPU. On NVIDIA 8800GTX GPU, the resulting speeds of block cipher AES, ARIA, and DES are 4.5Gbps, 7.0Gbps, and 2.8Gbps, respectively which are faster than the those on CPU.

PDF Version 1.4-1.6 Password Cracking in CUDA GPU Environment (PDF 버전 1.4-1.6의 CUDA GPU 환경에서 암호 해독 최적 구현)

  • Hyun Jun, Kim;Si Woo, Eum;Hwa Jeong, Seo
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.12 no.2
    • /
    • pp.69-76
    • /
    • 2023
  • Hundreds of thousands of passwords are lost or forgotten every year, making the necessary information unavailable to legitimate owners or authorized law enforcement personnel. In order to recover such a password, a tool for password cracking is required. Using GPUs instead of CPUs for password cracking can quickly process the large amount of computation required during the recovery process. This paper optimizes on GPUs using CUDA, with a focus on decryption of the currently most popular PDF 1.4-1.6 version. Techniques such as eliminating unnecessary operations of the MD5 algorithm, implementing 32-bit word integration of the RC4 algorithm, and using shared memory were used. In addition, autotune techniques were used to search for the number of blocks and threads that affect performance improvement. As a result, we showed throughput of 31,460 kp/s (kilo passwords per second) and 66,351 kp/s at block size 65,536, thread size 96 in RTX 3060, RTX 3090 environments, and improved throughput by 22.5% and 15.2%, respectively, compared to the cracking tool hashcat that achieves the highest throughput.

An MPI-CUDA Implementation for Parallel Scalability on Multi-GPU Clusters (멀티-GPU 기반 MPI-CUDA 병렬 성능 확장성)

  • Yi, Hong-Suk;Lee, Seung-Min
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06a
    • /
    • pp.13-15
    • /
    • 2012
  • 매우 빠른 GPU의 성능과 저가의 개발 비용으로, 최신 GPU는 대용량 계산과학 분야에 꼭 필수적인 자원으로 등장하였다. 이 논문에서는 멀티-GPU 클러스터 시스템에서 GPU 컴퓨팅 기술을 적용한 대용량 Monte Carlo 알고리즘을 개발하였다. MPI와 CUDA를 동시에 적용한 결과 8개 GPU까지 병렬 확장성을 얻을 수 있었다. 병렬 성능 확장성 분석 결과, 멀티-GPU 클러스터에서는 GPU 사이의 데이터 통신이 전체 프로그램 성능 향상을 결정하는 매우 중요한 요인임을 보였다.