• Title/Summary/Keyword: NVIDIA

Search Result 163, Processing Time 0.028 seconds

CUDA-based Parallel Bi-Conjugate Gradient Matrix Solver for BioFET Simulation (BioFET 시뮬레이션을 위한 CUDA 기반 병렬 Bi-CG 행렬 해법)

  • Park, Tae-Jung;Woo, Jun-Myung;Kim, Chang-Hun
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.48 no.1
    • /
    • pp.90-100
    • /
    • 2011
  • We present a parallel bi-conjugate gradient (Bi-CG) matrix solver for large scale Bio-FET simulations based on recent graphics processing units (GPUs) which can realize a large-scale parallel processing with very low cost. The proposed method is focused on solving the Poisson equation in a parallel way, which requires massive computational resources in not only semiconductor simulation, but also other various fields including computational fluid dynamics and heat transfer simulations. As a result, our solver is around 30 times faster than those with traditional methods based on single core CPU systems in solving the Possion equation in a 3D FDM (Finite Difference Method) scheme. The proposed method is implemented and tested based on NVIDIA's CUDA (Compute Unified Device Architecture) environment which enables general purpose parallel processing in GPUs. Unlike other similar GPU-based approaches which apply usually 32-bit single-precision floating point arithmetics, we use 64-bit double-precision operations for better convergence. Applications on the CUDA platform are rather easy to implement but very hard to get optimized performances. In this regard, we also discuss the optimization strategy of the proposed method.

Trends of Mobile GPU (모바일 GPU 동향)

  • Han, J.H.;Byun, J.G.;Eum, N.W.
    • Electronics and Telecommunications Trends
    • /
    • v.28 no.2
    • /
    • pp.50-57
    • /
    • 2013
  • 스마트폰 및 태블릿 PC에 들어가는 핵심 부품인 AP(Application Processor)는 모두 GPU(Graphics Processing Unit)를 내장하고 있다. 이는 칩 면적의 제약과 사용 가능한 전력의 한계로 데스크톱의 그래픽 카드에 탑재된 고성능 GPU와는 다른 설계 제약을 받는다. 본고에서는 고성능 GPU와 다른 설계 조건을 갖는 mobile GPU 기술에 대해서 알아보았고 대표적인 commercial mobile GPU인 Imagination, ARM, Qualcomm, NVidia사의 mobile GPU의 특징 및 성능에 대해서 알아보았다.

  • PDF

Trends in the Technology Development of Light Field (Light Field 기술개발 동향)

  • Chun, H.W.;Han, M.K.;Jang, J.H.
    • Electronics and Telecommunications Trends
    • /
    • v.33 no.2
    • /
    • pp.56-63
    • /
    • 2018
  • A light field (LF) is a field for expressing the intensity and direction of light reflected from a subject in a 3D space. Dual LF cameras are equipped in smart phones in Korea, mainly those by Samsung and LG. Lytro, Apple, Nvidia, MIT Media Lab, Magic Lip, and NHK are also developing LF camera technologies. LF displays are being researched by ETRI, Samsung Electronics, KIST, KAIST, SNU, MIT Media Lab, USC, HP, Apple, Microsoft, SeeReal, Musion Eyeliner, Dimenco, and Holografika.

GPU Accelating of SIFT detection (SIFT 추출의 GPU 가속)

  • Seo, Kyoung-Taek;Kwon, Oh-Young
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.10a
    • /
    • pp.238-241
    • /
    • 2015
  • 특징점 추출 알고리즘은 물체인식, 로보틱스, 비디오트래킹 등 많은 컴퓨터 비전 분야에 사용된다. 그 중 SIFT 알고리즘은 많은 계산량이 필요한 알고리즘으로 구성되어 있으므로 높은 화소의 이미지를 처리하기 위해서는 많은 시간이 소요되므로 GPU를 통한 가속이 필요하다. 본 논문에서는 NVIDIA GPU 장비를 사용하는 CUDA를 이용하여 SIFT 알고리즘을 병렬적으로 처리하여 4배 이상의 수행시간 감소 및 특징점이 많고 고해상도인 영상에서 효율이 더 높은 것을 확인하였다.

A Comparison among Methods using CUDA Programming (CUDA 프로그래밍 기법 비교 연구)

  • Ihm, Sun-Young;Park, Young-Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2013.05a
    • /
    • pp.138-139
    • /
    • 2013
  • GPU 를 활용하는 병렬 프로그래밍에 대한 관심이 높아지면서 이에 대한 연구가 활발히 진행되고 있다. GPU 의 성능이 높아지면서 이를 일반 연산에 사용하는 방법으로 NVIDIA 사에서 CUDA 프로그래밍 개발 환경을 제공하고 있다. 본 논문에서는 이 CUDA 프로그래밍 기법을 소개하고, 간단한 예제를 통해 CPU 와 GPU 를 사용하는 방법을 비교한다.

Real-time Geometric Correction System for Digital Image Projection onto Deformable Surface (변형 가능한 곡면에서의 디지털 영상 투영을 위한 실시간 기하 보정 시스템)

  • Lee, Young-Bo;Han, Sang-Hun;Kim, Jung-Hoon;Lee, Dong-Hoon;Yun, Tae-Soo
    • 한국HCI학회:학술대회논문집
    • /
    • 2008.02a
    • /
    • pp.39-44
    • /
    • 2008
  • This paper proposes a real-time geometric correction system based on a projector to project digital images onto deformable surface. Markers use to trace lots of corresponding points would spoil the projected image when the projector projects a digital image onto the surface because they leave marks on the surface. In addition, it is difficult to build a real-time geometric correction system since bottlenecks occur through the process of the geometric correction for projecting images. In this paper, we use invisible infrared markers and a vertex shader of GPU using Cg TookKit of NVIDIA in order to eliminate disadvantage and bottlenecks in the process of markers recognition so that it is possible to project natural correction images in real-time. As a result, this system overlays an interactive virtual texture onto the real paper by using the geometric transformation. Therefore, it is possible to develop variation of AR(Augmented Reality) based on digital contents systems.

  • PDF

Performance Enhancement and Evaluation of a Deep Learning Framework on Embedded Systems using Unified Memory (통합메모리를 이용한 임베디드 환경에서의 딥러닝 프레임워크 성능 개선과 평가)

  • Lee, Minhak;Kang, Woochul
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.7
    • /
    • pp.417-423
    • /
    • 2017
  • Recently, many embedded devices that have the computing capability required for deep learning have become available; hence, many new applications using these devices are emerging. However, these embedded devices have an architecture different from that of PCs and high-performance servers. In this paper, we propose a method that improves the performance of deep-learning framework by considering the architecture of an embedded device that shares memory between the CPU and the GPU. The proposed method is implemented in Caffe, an open-source deep-learning framework, and is evaluated on an NVIDIA Jetson TK1 embedded device. In the experiment, we investigate the image recognition performance of several state-of-the-art deep-learning networks, including AlexNet, VGGNet, and GoogLeNet. Our results show that the proposed method can achieve significant performance gain. For instance, in AlexNet, we could reduce image recognition latency by about 33% and energy consumption by about 50%.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

A Road Region Extraction Using OpenCV CUDA To Advance The Processing Speed (처리 속도 향상을 위해 OpenCV CUDA를 활용한 도로 영역 검출)

  • Lee, Tae-Hee;Hwang, Bo-Hyun;Yun, Jong-Ho;Choi, Myung-Ryul
    • Journal of Digital Convergence
    • /
    • v.12 no.6
    • /
    • pp.231-236
    • /
    • 2014
  • In this paper, we propose a processing speed improvement by adding a parallel processing based on device(graphic card) into a road region extraction by host(PC) based serial processing. The OpenCV CUDA supports the many functions of parallel processing method by interworking a conventional OpenCV with CUDA. Also, when interworking the OpenCV and CUDA, OpenCV functions completed a configuration are optimized the User's device(Graphic Card) specifications. Thus, OpenCV CUDA usage provides an algorithm verification and easiness of simulation result deduction. The proposed method is verified that the proposed method has a about 3.09 times faster processing speed than a conventional method by using OpenCV CUDA and graphic card of NVIDIA GeForce GTX 560 Ti model through experimentation.

Parallel Design and Implementation of Shot Boundary Detection Algorithm (샷 경계 탐지 알고리즘의 병렬 설계와 구현)

  • Lee, Joon-Goo;Kim, SeungHyun;You, Byoung-Moon;Hwang, DooSung
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.2
    • /
    • pp.76-84
    • /
    • 2014
  • As the number of high-density videos increase, parallel processing approaches are necessary to process a large-scale of video data. When a processing method of video data requires thousands of simple operations, GPU-based parallel processing is preferred to CPU-based parallel processing by way of reducing the time and space complexities of a given computation problem. This paper studies the parallel design and implementation of a shot-boundary detection algorithm. The proposed shot-boundary detection algorithm uses pixel brightness comparisons and global histogram data among the blocks of frames, and the computation of these data is characterized with the high parallelism for the related operations. In order to maximize these operations in parallel, the computations of the pixel brightness and histogram are designed in parallel and implemented in NVIDIA GPU. The GPU-based shot detection method is tested with 10 videos from the set of videos in National Archive of Korea. In experiments, the detection rate is similar but the computation time is about 10 time faster to that of the CPU-based algorithm.