• Title/Summary/Keyword: Compute unified device architecture

Search Result 60, Processing Time 0.022 seconds

Real-Time Object Segmentation in Image Sequences (연속 영상 기반 실시간 객체 분할)

  • Kang, Eui-Seon;Yoo, Seung-Hun
    • The KIPS Transactions:PartB
    • /
    • v.18B no.4
    • /
    • pp.173-180
    • /
    • 2011
  • This paper shows an approach for real-time object segmentation on GPU (Graphics Processing Unit) using CUDA (Compute Unified Device Architecture). Recently, many applications that is monitoring system, motion analysis, object tracking or etc require real-time processing. It is not suitable for object segmentation to procedure real-time in CPU. NVIDIA provide CUDA platform for Parallel Processing for General Computation to upgrade limit of Hardware Graphic. In this paper, we use adaptive Gaussian Mixture Background Modeling in the step of object extraction and CCL(Connected Component Labeling) for classification. The speed of GPU and CPU is compared and evaluated with implementation in Core2 Quad processor with 2.4GHz.The GPU version achieved a speedup of 3x-4x over the CPU version.

Integer-Pel Motion Estimation for HEVC on Compute Unified Device Architecture (CUDA)

  • Lee, Dongkyu;Sim, Donggyu;Oh, Seoung-Jun
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.6
    • /
    • pp.397-403
    • /
    • 2014
  • A new video compression standard called High Efficiency Video Coding (HEVC) has recently been released onto the market. HEVC provides higher coding performance compared to previous standards, but at the cost of a significant increase in encoding complexity, particularly in motion estimation (ME). At the same time, the computing capabilities of Graphics Processing Units (GPUs) have become more powerful. This paper proposes a parallel integer-pel ME (IME) algorithm for HEVC on GPU using the Compute Unified Device Architecture (CUDA). In the proposed IME, concurrent parallel reduction (CPR) is introduced. CPR performs several parallel reduction (PR) operations concurrently to solve two problems in conventional PR; low thread utilization and high thread synchronization latency. The proposed encoder reduces the portion of IME in the encoder to almost zero with a 2.3% increase in bitrate. In terms of IME, the proposed IME is up to 172.6 times faster than the IME in the HEVC reference model.

Parallel Implementation of the Recursive Least Square for Hyperspectral Image Compression on GPUs

  • Li, Changguo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.7
    • /
    • pp.3543-3557
    • /
    • 2017
  • Compression is a very important technique for remotely sensed hyperspectral images. The lossless compression based on the recursive least square (RLS), which eliminates hyperspectral images' redundancy using both spatial and spectral correlations, is an extremely powerful tool for this purpose, but the relatively high computational complexity limits its application to time-critical scenarios. In order to improve the computational efficiency of the algorithm, we optimize its serial version and develop a new parallel implementation on graphics processing units (GPUs). Namely, an optimized recursive least square based on optimal number of prediction bands is introduced firstly. Then we use this approach as a case study to illustrate the advantages and potential challenges of applying GPU parallel optimization principles to the considered problem. The proposed parallel method properly exploits the low-level architecture of GPUs and has been carried out using the compute unified device architecture (CUDA). The GPU parallel implementation is compared with the serial implementation on CPU. Experimental results indicate remarkable acceleration factors and real-time performance, while retaining exactly the same bit rate with regard to the serial version of the compressor.

Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture (GPU 하드웨어 아키텍처 기반 sub-warp 단위 병렬 프리픽스(prefix) 연산의 정확한 구현)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.3
    • /
    • pp.613-619
    • /
    • 2017
  • This paper presents a CUDA (Compute Unified Device Architecture) code to achieve correct GPU parallel segmented prefix operation results with less than 32 segment length for large data arrays. Mark Harris and Michael Garland had published CUDA code to address the tasks. This paper shows that their code does not generate correct results when the local segment length is less than 32, discusses the cause of the problem, and presents a CUDA code that generates correct results. The segmented parallel prefix operation presented in this paper can be applied as a building block to various large parallel processing algorithms including the k-nearest neighbor search problems.

Fast Generation of Digital Hologram Based on Multi-GPU (Multi-GPU 기반의 고속 디지털 홀로그램 생성)

  • Song, Joong-Seok;Park, Jung-Sik;Seo, Young-Ho;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.16 no.6
    • /
    • pp.1009-1017
    • /
    • 2011
  • Fast generation of digital hologram is of importance for real-time holography broadcasting. In this paper, we propose such a method that parallelizes the Computer-Generated Holography (CGH) algorithm for digital hologram generation and make it faster using Multi Graphic Processing Unit (Multi-GPU) with help of the Compute Unified Device Architecture (CUDA) and the Open Multi-Processing (OpenMP). In addition, we propose optimization methods such as fixation variable, vectorization, and loop unrolling for making the CGH algorithm much faster. Experimental results show that our method is about 9,700 times faster than a CPU-based one.

Pedestrians Action Interpretation based on CUDA for Traffic Signal Control (교통신호제어를 위한 CUDA기반 보행자 행동판단)

  • Lee, Hong-Chang;Rhee, Sang-Yong;Kim, Young-Baek
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.5
    • /
    • pp.631-637
    • /
    • 2010
  • In this paper, We propose a method of motion interpretation of pedestrian for active traffic signal control. We detect pedestrian object in a movie of crosswalk area by using the code book method and acquire contour information. To do this stage fast, we use parallel processing based on CUDA (Compute Unified Device Architecture). And we remove shadow which causes shape distortion of objects. Shadow removed object is judged by using the hilbert scan distance whether to human or noise. If the objects are judged as a human, we analyze pedestrian objects' motion, face area feature, waiting time to decide that they have intetion to across a crosswalk for pdestrians. Traffic signal can be controlled after judgement.

Analysis of Programming Techniques for Creating Optimized CUDA Software (최적화된 CUDA 소프트웨어 제작을 위한 프로그래밍 기법 분석)

  • Kim, Sung-Soo;Kim, Dong-Heon;Woo, Sang-Kyu;Ihm, In-Sung
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.7
    • /
    • pp.775-787
    • /
    • 2010
  • Unlike general-purpose CPUs, the GPUs have been specialized as many-core streaming processors, and are frequently replacing the CPUs in an increasing range of computations thanks to their outstanding parallel computing capacity. In order to respond to such trend, NVIDIA has recently issued a new parallel computing architecture called CUDA(Compute Unified Device Architecture), offering a flexible GPU programming environment for GPGPU(General Purpose GPU) computing. In general, when programmers use the CUDA API, they should clearly understand many aspects of GPU's computing architecture to produce efficient parallel software. In this article, we explain several optimization techniques for CUDA programming that we have verified through a lot of experiment and trial and error, and review how those techniques affect the performance of code execution. In particular, we use a specific problem as an example to analyze several elements that affect performances, such as effective accesses to hierarchical memory system, processor occupancy, and latency hiding. In conclusion, we present several directions that may be utilized effectively in CUDA-based parallel programming.

Acceleration of Feature-Based Image Morphing Using GPU (GPU를 이용한 특징 기반 영상모핑의 가속화)

  • Kim, Eun-Ji;Yoon, Seung-Hyun;Lee, Jieun
    • Journal of the Korea Computer Graphics Society
    • /
    • v.20 no.2
    • /
    • pp.13-24
    • /
    • 2014
  • In this study, a graphics-processing-unit (GPU)-based acceleration technique is proposed for the feature-based image morphing. This technique uses the depth-buffer of the graphics hardware to calculate efficiently the shortest distance between a pixel and the control lines. The pairs of control lines between the source image and the destination image are determined by user's input, and the distance function of each control line is rendered using two rectangles and two cones. The distance between each pixel and its nearest control line is stored in the depth buffer through the graphics pipeline, and this is used to conduct the morphing operation efficiently. The pixel-unit morphing operation is parallelized using the compute unified device architecture (CUDA) to reduce the morphing time. We demonstrate the efficiency of the proposed technique using several experimental results.

Acceleration techniques for GPGPU-based Maximum Intensity Projection (GPGPU 환경에서 최대휘소투영 렌더링의 고속화 방법)

  • Kye, Hee-Won;Kim, Jun-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.8
    • /
    • pp.981-991
    • /
    • 2011
  • MIP(Maximum Intensity Projection) is a volume rendering technique which is essential for the medical imaging system. MIP rendering based on the ray casting method produces high quality images but takes a long time. Our aim is improvement of the rendering speed using GPGPU(General-purpose computing on Graphic Process Unit) technique. In this paper, we present the ray casting algorithm based on CUDA(an acronym for Compute Unified Device Architecture) which is a programming language for GPGPU and we suggest new acceleration methods for CUDA. In detail, we propose the block based space leaping which skips unnecessary regions of volume data for CUDA, the bisection method which is a fast method to find a block edge, and the initial value estimation method which improves the probability of space leaping. Due to the proposed methods, we noticeably improve the rendering speed without image quality degradation.

A Development of Fusion Processor Architecture for Efficient Main Memory Access in CPU-GPU Environment (CPU-GPU환경에서 효율적인 메인메모리 접근을 위한 융합 프로세서 구조 개발)

  • Park, Hyun-Moon;Kwon, Jin-San;Hwang, Tae-Ho;Kim, Dong-Sun
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.11 no.2
    • /
    • pp.151-158
    • /
    • 2016
  • The HSA resolves an old problem with existing CPU and GPU architectures by allowing both units to directly access each other's memory pools via unified virtual memory. In a physically realized system, however, frequent data exchanges between CPU and GPU for a virtual memory block result bottlenecks and coherence request overheads. In this paper, we propose Fusion Processor Architecture for efficient access of main memory from both CPU and GPU. It consists of Job Manager, Re-mapper, and Pre-fetcher to control, organize, and distribute work loads and working areas for GPU cores. These components help on reducing memory exchanges between the two processors and improving overall efficiency by eliminating faulty page table requests. To verify proposed algorithm architectures, we develop an emulator based on QEMU, and compare several architectures such as CUDA(Compute Unified Device Architecture), OpenMP, OpenCL. As a result, Proposed fusion processor architectures show 198% faster than others by removing unnecessary memory copies and cache-miss overheads.