• Title/Summary/Keyword: Multi-GPU

Search Result 122, Processing Time 0.024 seconds

Analysis of Job Scheduling and the Efficiency for Multi-core Mobile GPU (멀티코어형 모바일 GPU의 작업 분배 및 효율성 분석)

  • Lim, Hyojeong;Han, Donggeon;Kim, Hyungshin
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.15 no.7
    • /
    • pp.4545-4553
    • /
    • 2014
  • Mobile GPU has led to the rapid development of smart phone graphic technology. Most recent smart phones are equipped with high-performance multi-core GPU. How a multi-core mobile GPU can be utilized efficiently will be a critical issue for improving the smart phone performance. On the other hand, most current research has focused on a single-core mobile GPU; studies of multi-core mobile GPU are rare. In this paper, the job scheduling patterns and the efficiency of multi-core mobile GPU are analyzed. In the profiling result, despite the higher number of GPU cores, the total processing time required for certain graphics applications were increased. In addition, when GPU is processing for 3D games, a substantial amount of overhead is caused by communication between not only the CPU and GPU, but also within the GPUs. These results confirmed that more active research for multi-core mobile GPU should be performed to optimize the present mobile GPUs.

High-Performance Multi-GPU Rendering Based on Implicit Synchronization (묵시적 동기화 기반의 고성능 다중 GPU 렌더링)

  • Kim, Younguk;Lee, Sungkil
    • Journal of KIISE
    • /
    • v.42 no.11
    • /
    • pp.1332-1338
    • /
    • 2015
  • Recently, growing attention has been paid to multi-GPU rendering to support real-time high-quality rendering at high resolution. In order to attain high performance in real-time multi-GPU rendering, great care needs to be taken to reduce the overhead of data transfer among GPUs and frame composition. This paper presents a novel multi-GPU algorithm that greatly enhances split frame rendering with implicit query-based synchronization. In order to support implicit synchronization in frame composition, we further present a message queue-based scheduling algorithm. We carried out an experiment to evaluate our algorithm, and found that our algorithm improved rendering performance up to 200% more than previously existing algorithms.

Multi GPU Based Image Registration for Cerebrovascular Extraction and Interactive Visualization (뇌혈관 추출과 대화형 가시화를 위한 다중 GPU기반 영상정합)

  • Park, Seong-Jin;Shin, Yeong-Gil
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.6
    • /
    • pp.445-449
    • /
    • 2009
  • In this paper, we propose a computationally efficient multi GPU accelerated image registration technique to correct the motion difference between the pre-contrast CT image and post-contrast CTA image. Our method consists of two steps: multi GPU based image registration and a cerebrovascular visualization. At first, it computes a similarity measure considering the parallelism between both GPUs as well as the parallelism inside GPU for performing the voxel-based registration. Then, it subtracts a CT image transformed by optimal transformation matrix from CTA image, and visualizes the subtracted volume using GPU based volume rendering technique. In this paper, we compare our proposed method with existing methods using 5 pairs of pre-contrast brain CT image and post-contrast brain CTA image in order to prove the superiority of our method in regard to visual quality and computational time. Experimental results show that our method well visualizes a brain vessel, so it well diagnose a vessel disease. Our multi GPU based approach is 11.6 times faster than CPU based approach and 1.4 times faster than single GPU based approach for total processing.

Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment

  • Choi, HyeonSeong;Kim, Youngrang;Lee, Jaehwan;Kim, Yoonhee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.3
    • /
    • pp.911-931
    • /
    • 2021
  • Recently, most cloud services use Docker container environment to provide their services. However, there are no researches to evaluate the performance of communication libraries for multi-GPU based distributed deep learning in a Docker container environment. In this paper, we propose an efficient communication architecture for multi-GPU based deep learning in a Docker container environment by evaluating the performances of various communication libraries. We compare the performances of the parameter server architecture and the All-reduce architecture, which are typical distributed deep learning architectures. Further, we analyze the performances of two separate multi-GPU resource allocation policies - allocating a single GPU to each Docker container and allocating multiple GPUs to each Docker container. We also experiment with the scalability of collective communication by increasing the number of GPUs from one to four. Through experiments, we compare OpenMPI and MPICH, which are representative open source MPI libraries, and NCCL, which is NVIDIA's collective communication library for the multi-GPU setting. In the parameter server architecture, we show that using CUDA-aware OpenMPI with multi-GPU per Docker container environment reduces communication latency by up to 75%. Also, we show that using NCCL in All-reduce architecture reduces communication latency by up to 93% compared to other libraries.

Parallel Range Query Processing with R-tree on Multi-GPUs (다중 GPU를 이용한 R-tree의 병렬 범위 질의 처리 기법)

  • Ryu, Hongsu;Kim, Mincheol;Choi, Wonik
    • Journal of KIISE
    • /
    • v.42 no.4
    • /
    • pp.522-529
    • /
    • 2015
  • Ever since the R-tree was proposed to index multi-dimensional data, many efforts have been made to improve its query performances. One common trend to improve query performance is to parallelize query processing with the use of multi-core architectures. To this end, a GPU-base R-tree has been recently proposed. However, even though a GPU-based R-tree can exhibit an improvement in query performance, it is limited in its ability to handle large volumes of data because GPUs have limited physical memory. To address this problem, we propose MGR-tree (Multi-GPU R-tree), which can manage large volumes of data by dividing nodes into multiple GPUs. Our experiments show that MGR-tree is up to 9.1 times faster than a sequential search on a GPU and up to 1.6 times faster than a conventional GPU-based R-tree.

A Performance Enhancement of a Naval Multi-Function Radar Signal Processor (GPU를 이용한 함정용 다기능레이다 신호처리기 성능 개선 연구)

  • Kwon, Se-Woong;Hong, Sung-Min;Ryu, Seong-Hyun;Jung, Chae-Hyun;Sohn, Sung-Hwan;Lee, Ki-Won;Kang, Yeon-Duk
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.2
    • /
    • pp.141-147
    • /
    • 2020
  • We studied for GPU based signal processor for naval multi-function radar. We implemented processing software both DSP and GPU, and compared computation performances and power consumption. As a result, computation performance was enhanced from 1.2 to 4.1 times compared with a DSP result. From the results, GPU can alternating DSP based signal processor for common radar processor even though Naval Multi Function Radar.

Fast Generation of Digital Hologram Based on Multi-GPU (Multi-GPU 기반의 고속 디지털 홀로그램 생성)

  • Song, Joong-Seok;Park, Jung-Sik;Seo, Young-Ho;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.16 no.6
    • /
    • pp.1009-1017
    • /
    • 2011
  • Fast generation of digital hologram is of importance for real-time holography broadcasting. In this paper, we propose such a method that parallelizes the Computer-Generated Holography (CGH) algorithm for digital hologram generation and make it faster using Multi Graphic Processing Unit (Multi-GPU) with help of the Compute Unified Device Architecture (CUDA) and the Open Multi-Processing (OpenMP). In addition, we propose optimization methods such as fixation variable, vectorization, and loop unrolling for making the CGH algorithm much faster. Experimental results show that our method is about 9,700 times faster than a CPU-based one.

Efficient Implementing of DNA Computing-inspired Pattern Classifier Using GPU (GPU를 이용한 DNA 컴퓨팅 기반 패턴 분류기의 효율적 구현)

  • Choi, Sun-Wook;Lee, Chong-Ho
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.58 no.7
    • /
    • pp.1424-1434
    • /
    • 2009
  • DNA computing-inspired pattern classification based on the hypernetwork model is a novel approach to pattern classification problems. The hypernetwork model has been shown to be a powerful tool for multi-class data analysis. However, the ordinary hypernetwork model has limitations, such as operating sequentially only. In this paper, we propose a efficient implementing method of DNA computing-inspired pattern classifier using GPU. We show simulation results of multi-class pattern classification from hand-written digit data, DNA microarray data and 8 category scene data for performance evaluation. and we also compare of operation time of the proposed DNA computing-inspired pattern classifier on each operating environments such as CPU and GPU. Experiment results show competitive diagnosis results over other conventional machine learning algorithms. We could confirm the proposed DNA computing-inspired pattern classifier, designed on GPU using CUDA platform, which is suitable for multi-class data classification. And its operating speed is fast enough to comply point-of-care diagnostic purpose and real-time scene categorization and hand-written digit data classification.

Fast and Efficient Implementation of Neural Networks using CUDA and OpenMP (CUDA와 OPenMP를 이용한 빠르고 효율적인 신경망 구현)

  • Park, An-Jin;Jang, Hong-Hoon;Jung, Kee-Chul
    • Journal of KIISE:Software and Applications
    • /
    • v.36 no.4
    • /
    • pp.253-260
    • /
    • 2009
  • Many algorithms for computer vision and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation has two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job that needs much cooperation between CPU and GPU, which is usual in image processing and pattern recognition contrary to the graphic area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text extraction system using the proposed architecture, and the computational times showed about 15 times faster than implementation on only GPU without OpenMP.

Empirical Experiments for Convolution Layer Optimization on Multi-GPUs (Multi-GPU 환경에서의 Convolution Layer 최적화 실험)

  • Jiwon Ha;Theodora Adufu;Yoonhee Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.11-12
    • /
    • 2023
  • GPGPU 환경에서의 ML 모델이 다양한 분야에 지속적으로 활용되면서, 이미지 분할(image segmentation) 연구가 활발하다. multi-GPU 환경에서 성능 최적화를 위하여 병렬화 기법들이 활용되고 있다. 본 연구에서는 multi-GPU 환경에서 U-Net 모델의 전체 수행 시간을 단축하기 위해 convolution 연산을 최적화하는 기법을 적용하는 실험을 진행하였고 shared memory, data parallelism 를 적용하여 82% 성능 향상을 보여주었다.