• Title/Summary/Keyword: Embedded GPGPU

Search Result 11, Processing Time 0.022 seconds

Performance Enhancement and Evaluation of AES Cryptography using OpenCL on Embedded GPGPU (OpenCL을 이용한 임베디드 GPGPU환경에서의 AES 암호화 성능 개선과 평가)

  • Lee, Minhak;Kang, Woochul
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.7
    • /
    • pp.303-309
    • /
    • 2016
  • Recently, an increasing number of embedded processors such as ARM Mali begin to support GPGPU programming frameworks, such as OpenCL. Thus, GPGPU technologies that have been used in PC and server environments are beginning to be applied to the embedded systems. However, many embedded systems have different architectural characteristics compare to traditional PCs and low-power consumption and real-time performance are also important performance metrics in these systems. In this paper, we implement a parallel AES cryptographic algorithm for a modern embedded GPU using OpenCL, a standard parallel computing framework, and compare performance against various baselines. Experimental results show that the parallel GPU AES implementation can reduce the response time by about 1/150 and the energy consumption by approximately 1/290 compare to OpenMP implementation when 1000KB input data is applied. Furthermore, an additional 100 % performance improvement of the parallel AES algorithm was achieved by exploiting the characteristics of embedded GPUs such as removing copying data between GPU and host memory. Our results also demonstrate that higher performance improvement can be achieved with larger size of input data.

스마트폰에서의 영상처리를 위한 GPU 활용

  • Park, In-Gyu;Choe, Ho-Yeol
    • Information and Communications Magazine
    • /
    • v.29 no.4
    • /
    • pp.46-51
    • /
    • 2012
  • 본 기고에서는 최근 스마트폰에서 요구되는 다양한 멀티미디어 어플리케이션을 embedded GPU(Graphics Processing Unit)를 이용하여 고속 병렬처리하기 위한 GPGPU (General-Purpose Computing on GPU) 기술 및 영상처리 분야의 응용 사례를 소개한다. 일반적인 데스크탑 컴퓨팅 환경과 달리 제약사항이 많은 embedded 환경에서의 GPGPU 응용 기술은 아직 초기단계이다. 그러나 급격히 발전하는 embedded GPU IP와 OpenCL과 같은 API의 등장으로 embedded GPU를 이용한 고속 병렬처리 환경이 수 년 이내에 일반화 될 것이다. 본 기고에서는 그 가능성을 점검하기 위하여 embedded GPU에서의 영상처리를 위한 최신 하드웨어와 소프트웨어 환경의 발전 동향을 소개한다. 더불어 최신 스마트폰에서의 GPGPU기술을 사용한 영상처리 사례와 영상처리 알고리즘의 GPGPU 알고리즘 구현시 고려해야 할 주요 사항을 정리한다.

Implementation and Performance Evaluation of the Faddev-Leverrier Algorithm using GPGPU (GPGPU를 이용한 파데브-레브리어 알고리즘 구현 및 성능 분석)

  • Park, Yong-Hun;Kim, Cheol-Hong;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.8 no.3
    • /
    • pp.171-178
    • /
    • 2013
  • In this paper, we implement the Faddev-Leverier algorithm using GPGPU (General-Purpose Graphics Processing Unit) to accelerate singular value decomposition. In addition, we compare the performance of the algorithm using CPU and CPU plus GPGPU for eleven ${\times}n$ matrix sizes in order to decompose singular values, where =4, 8, 16, 32, 64, 128, 256, 512, 1,024, 2,048, and 4,096. Experimental results indicate that CPU achieves better performance than CPU plus GPGPU for $n{\leq}64$ because of a large number of read and write operations between CPU and GPGPU. However, CPU plus GPGPU outperforms CPU exponentially in the execution time for $n{\geq}64$.

Parallelization of Feature Detection and Panorama Image Generation using OpenCL and Embedded GPU (OpenCL 및 Embedded GPU를 이용한 영상 특징 추출 및 파노라마 영상 생성의 병렬화)

  • Kang, Seung Heon;Lee, Seung-Jae;Lee, Man Hee;Park, In Kyu
    • Journal of Broadcast Engineering
    • /
    • v.19 no.3
    • /
    • pp.316-328
    • /
    • 2014
  • In this paper, we parallelize the popular feature detection algorithms, i.e. SIFT and SURF, and its application to fast panoramic image generation on the latest embedded GPU. Parallelized algorithms are implemented using recently developed OpenCL as the embedded GPGPU software platform. We compare the implementation efficiency and speed performance of conventional OpenGL Shading Language and OpenCL. Experimental result shows that implementation on OpenCL has comparable performance with GLSL. Compared with the performance on the embedded CPU in the same application processor, the embedded GPU runs 3~4 times faster. As an example of using feature extraction, panorama image synthesis is performed on embedded GPU by applying image matching using detected features.

An Implementation of a Convolutional Accelerator based on a GPGPU for a Deep Learning (Deep Learning을 위한 GPGPU 기반 Convolution 가속기 구현)

  • Jeon, Hee-Kyeong;Lee, Kwang-yeob;Kim, Chi-yong
    • Journal of IKEEE
    • /
    • v.20 no.3
    • /
    • pp.303-306
    • /
    • 2016
  • In this paper, we propose a method to accelerate convolutional neural network by utilizing a GPGPU. Convolutional neural network is a sort of the neural network learning features of images. Convolutional neural network is suitable for the image processing required to learn a lot of data such as images. The convolutional layer of the conventional CNN required a large number of multiplications and it is difficult to operate in the real-time on the embedded environment. In this paper, we reduce the number of multiplications through Winograd convolution operation and perform parallel processing of the convolution by utilizing SIMT-based GPGPU. The experiment was conducted using ModelSim and TestDrive, and the experimental results showed that the processing time was improved by about 17%, compared to the conventional convolution.

Efficient Thread Allocation Method of Convolutional Neural Network based on GPGPU (GPGPU 기반 Convolutional Neural Network의 효율적인 스레드 할당 기법)

  • Kim, Mincheol;Lee, Kwangyeob
    • Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology
    • /
    • v.7 no.10
    • /
    • pp.935-943
    • /
    • 2017
  • CNN (Convolution neural network), which is used for image classification and speech recognition among neural networks learning based on positive data, has been continuously developed to have a high performance structure to date. There are many difficulties to utilize in an embedded system with limited resources. Therefore, we use GPU (General-Purpose Computing on Graphics Processing Units), which is used for general-purpose operation of GPU to solve the problem because we use pre-learned weights but there are still limitations. Since CNN performs simple and iterative operations, the computation speed varies greatly depending on the thread allocation and utilization method in the Single Instruction Multiple Thread (SIMT) based GPGPU. To solve this problem, there is a thread that needs to be relaxed when performing Convolution and Pooling operations with threads. The remaining threads have increased the operation speed by using the method used in the following feature maps and kernel calculations.

Implementation of Integrated CPU-GPU for Efficient Uniform Memory Access Method and Verification System (CPU-GPU간 긴밀성을 위한 효율적인 공유메모리 접근 방법과 검증 시스템 구현)

  • Park, Hyun-moon;Kwon, Jinsan;Hwang, Tae-ho;Kim, Dong-Sun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.11 no.2
    • /
    • pp.57-65
    • /
    • 2016
  • In this paper, we propose a system for efficient use of shared memory between CPU and GPU. The system, called Fusion Architecture, assures consistency of the shared memory and minimizes cache misses that frequently occurs on Heterogeneous System Architecture or Unified Virtual Memory based systems. It also maximizes the performance for memory intensive jobs by efficient allocation of GPU cores. To test between architectures on various scenarios, we introduce the Fusion Architecture Analyzer, which compares OpenMP, OpenCL, CUDA, and the proposed architecture in terms of memory overhead and process time. As a result, Proposed fusion architectures show that the Fusion Architecture runs benchmarks 55% faster and reduces memory overheads by 220% in average.

Fast Medical Volume Decompression Using GPGPU (GPGPU를 이용한 고속 의료 볼륨 영상의 압축 복원)

  • Kye, Hee-Won
    • Journal of Korea Multimedia Society
    • /
    • v.15 no.5
    • /
    • pp.624-631
    • /
    • 2012
  • For many medical imaging systems, volume datasets are stored as a compressed form, so that the dataset has to be decompressed before it is visualized. Since the decompression process takes quite a long time, we present an acceleration method for medical volume decompression using GPU. Our method supports that both lossy and lossless compression and progressive refinement is possible to satisfy variable user requirements. Moreover, our decompression method is well parallelized for GPU so that the decompression takes a very short time. Finally, we designed that the decompression and volume rendering work in one framework so that the selective decompression is available. As a result, we gained additional improvement in volume decompression.

Analysis of Implementing Mobile Heterogeneous Computing for Image Sequence Processing

  • BAEK, Aram;LEE, Kangwoon;KIM, Jae-Gon;CHOI, Haechul
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.4948-4967
    • /
    • 2017
  • On mobile devices, image sequences are widely used for multimedia applications such as computer vision, video enhancement, and augmented reality. However, the real-time processing of mobile devices is still a challenge because of constraints and demands for higher resolution images. Recently, heterogeneous computing methods that utilize both a central processing unit (CPU) and a graphics processing unit (GPU) have been researched to accelerate the image sequence processing. This paper deals with various optimizing techniques such as parallel processing by the CPU and GPU, distributed processing on the CPU, frame buffer object, and double buffering for parallel and/or distributed tasks. Using the optimizing techniques both individually and combined, several heterogeneous computing structures were implemented and their effectiveness were analyzed. The experimental results show that the heterogeneous computing facilitates executions up to 3.5 times faster than CPU-only processing.

A GPU scheduling framework for applications based on dataflow specification (데이터 플로우 기반 응용들을 위한 GPU 스케줄링 프레임워크)

  • Lee, Yongbin;Kim, Sungchan
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.10
    • /
    • pp.1189-1197
    • /
    • 2014
  • Recently, general purpose graphic processing units(GPUs) are being widely used in mobile embedded systems such as smart phone and tablet PCs. Because of architectural limitations of mobile GPGPUs, only a single program is allowed to occupy a GPU at a time in a non-preemptive way. As a result, it is difficult to meet performance requirements of applications such as frame rate or response time if applications running on a GPU are not scheduled properly. To tackle this difficulty, we propose to specify applications using synchronous data flow model of computation such that applications are formed with edges and nodes. Then nodes of applications are scheduled onto a GPU unlike conventional scheduling an application as a whole. This approach allows applications to share a GPU at a finer granularity, node (or task)-level, providing several benefits such as eliminating need for manually partitioning applications and better GPU utilization. Furthermore, any scheduling policy can be applied in response to the characteristics of applications.