• Title/Summary/Keyword: SIMD Architecture

Search Result 60, Processing Time 0.037 seconds

Architecture of General and Intelligent Parallel Processing System (범용성과 지능성을 갖는 병렬 처리기 구조)

  • Lee, Hyung;Choi, Sung-Hyuk;Kim, Jung-Bae;Park, Jong-Won
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2000.10a
    • /
    • pp.601-604
    • /
    • 2000
  • 본 논문에서는 방대한 양의 영상데이터를 실시간으로 처리하기 위해 제안된 Park's 다중접근 기억장치를 이용한 SIMD 병렬 처리기 시스템의 효율성을 높이기 위하여 Semi-MIMD 구조를 갖는 병렬처리기 시스템을 제안한다.

  • PDF

Design and Implementation of a DSP Chip for Portable Multimedia Applications (휴대 멀티미디어 응용을 위한 DSP 칩 설계 및 구현)

  • 윤성현;선우명훈
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.12
    • /
    • pp.31-39
    • /
    • 1998
  • This paper presents the design and implementation of a new multimedia fixed-point DSP (MDSP) core for portable multimedia applications. The MDSP instruction set is designed through the analysis of multimedia algorithms and DSP instruction sets. The MDSP architecture employs parallel processing techniques, such as SIMD and vector processing as well as DSP techniques. The instruction set can handle various data formats and MDSP can perform two MAC operations in parallel. The switching network and packing network can increase the performance by overlapping data rearrangement cycles with computation cycles. We have designed Verilog HDL models and the 0.6 $\mu\textrm{m}$ Samsung KG75000 SOG library is used. The total gate count is 68,831 and the clock frequency is 30 MHz.

  • PDF

An Echo Processor for Medical Ultrasound Imaging Using a GPU with Massively Parallel Processing Architecture (병렬 처리 구조의 GPU를 이용한 의료 초음파 영상용 에코 신호 처리기)

  • Seo, Sin-Hyeok;Sohn, Hak-Yeol;Song, Tai-Kyong
    • Proceedings of the IEEK Conference
    • /
    • 2008.06a
    • /
    • pp.871-872
    • /
    • 2008
  • The method and results of the software implementation of a echo processor for medical ultrasound imaging using a GPU (NVIDIA G80) is presented. The echo signal processing functions are modified in a SIMD manner suitable for the GPU's massively parallel processing architecture so that the GPU's 128 ALUs are utilized nearly 100%. The preliminary result for a frame of image composed of 128 scan lines, each having 10240 16-bit samples, shows that the echo processor can be inplemented at a high rate of 30 frames per second when implemented in C, which is close to the optimized assembly codes running on the TI's TMS320C6416 DSP.

  • PDF

A Design of Reconfigurable Neural Network Processor (재구성 가능한 신경망 프로세서의 설계)

  • 장영진;이현수
    • Proceedings of the IEEK Conference
    • /
    • 1999.11a
    • /
    • pp.368-371
    • /
    • 1999
  • In this paper, we propose a neural network processor architecture with on-chip learning and with reconfigurability according to the data dependencies of the algorithm applied. For the neural network model applied, the proposed architecture can be configured into either SIMD or SRA(Systolic Ring Array) without my changing of on-chip configuration so as to obtain a high throughput. However, changing of system configuration can be controlled by user program. To process activation function, which needs amount of cycles to get its value, we design it by using PWL(Piece-Wise Linear) function approximation method. This unit has only single latency and the processing ability of non-linear function such as sigmoid gaussian function etc. And we verified the processing mechanism with EBP(Error Back-Propagation) model.

  • PDF

GPU Implementation Techniques of Genetic Algorithm and Comparative Studies (유전 알고리즘의 GPU 구현 기법 및 비교 연구)

  • Hyeon, Byeong-Yong;Seo, Ki-Sung
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.17 no.4
    • /
    • pp.328-335
    • /
    • 2011
  • GPU (Graphics Processing Units) is consists of SIMD (Single Instruction Multiple Data) architecture and provides fast parallel processing. A GA (Genetic Algorithm), which requires large computations, is implemented in GPU using CUDA (Compute Unified Device Architecture). Three kinds of execution models are presented according to different combinations of processing modules in GPU. Comparison experiments between GPU models and CPU are tested for a couple of benchmark problems by variation of population sizes and complexity of problem sizes.

Parallel Fuzzy Information Processing System - KAFA : KAist Fuzzy Accelerator -

  • Kim, Young-Dal;Lee, Hyung-Kwang;Park, Kyu-Ho
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 1993.06a
    • /
    • pp.981-984
    • /
    • 1993
  • During the past decade, several specific hardwares for fast fuzzy inference have been developed. Most of them are dedicated to a specific inference method and thus cannot support other inference methods. In this paper, we present a hardware architecture called KAFA(KAist Fuzzy Accelerator) which provides various fuzzy inference methods and fuzzy set operators. The architecture has SIMD structure, which consists of two parts; system control/interface unit(Main Controller) and arithmetic units(FPEs). Using the parallel processing technology, the KAFA has the high performance for fuzzy information processing. The speed of the KAFA holds promise for the development of the new fuzzy application systems.

  • PDF

Implementation of an Optimal Many-core Processor for Beamforming Algorithm of Mobile Ultrasound Image Signals (모바일 초음파 영상신호의 빔포밍 기법을 위한 최적의 매니코어 프로세서 구현)

  • Choi, Byong-Kook;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.8
    • /
    • pp.119-128
    • /
    • 2011
  • This paper introduces design space exploration of many-core processors that meet high performance and low power required by the beamforming algorithm of image signals of mobile ultrasound. For the design space exploration of the many-core processor, we mapped different number of ultrasound image data to each processing element of many-core, and then determined an optimal many-core processor architecture in terms of execution time, energy efficiency and area efficiency. Experimental results indicate that PE=4096 and 1024 provide the highest energy efficiency and area efficiency, respectively. In addition, PE=4096 achieves 46x and 10x better than TI DSP C6416, which is widely used for ultrasound image devices, in terms of energy efficiency and area efficiency, respectively.

Architecture Exploration of Optimal Many-Core Processors for a Vector-based Rasterization Algorithm (래스터화 알고리즘을 위한 최적의 매니코어 프로세서 구조 탐색)

  • Son, Dong-Koo;Kim, Cheol-Hong;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.9 no.1
    • /
    • pp.17-24
    • /
    • 2014
  • In this paper, we implement and evaluate the performance of a vector-based rasterization algorithm for 3D graphics by using a SIMD (single instruction multiple data) many-core processor architecture. In addition, we evaluate the impact of a data-per-processing elements (DPE) ratio that is defined as the amount of data directly mapped to each processing element (PE) within many-core in terms of performance, energy efficiency, and area efficiency. For the experiment, we utilize seven different PE configurations by varying the DPE ratio (or the number PEs), which are implemented in the same 130 nm CMOS technology with a 500 MHz clock frequency. Experimental results indicate that the optimal PE configuration is achieved as the DPE ratio is in the range from 16,384 to 256 (or the number of PEs is in the range from 16 and 1,024), which meets the requirements of mobile devices in terms of the optimal performance and efficiency.

A Design of a Vertex Shader for Mobile Devices (Mobile 기기에 적합한 Vertex Shader 의 설계 및 구현)

  • Jeong, Hyung-Ki;Nam, Ki-Hun;Lee, Kwang-Yeob;Hur, Hyun-Min;Lee, Byung-Ok;Lee, James
    • Proceedings of the IEEK Conference
    • /
    • 2005.11a
    • /
    • pp.751-754
    • /
    • 2005
  • In this paper, we designed a vertex shader for mobile devices. Proposed Vertex shader is compatible with the OpenGL ARB & DirectX 8.0 Vertex Shader 1.1 and is organized of modified IEEE-754 24 bits float point SIMD architecture. All float point arithmetic unit process 1 cycle operation with 100Mhz frequency more. We made a vertex shader demo system with Xilinx-Virtex II and get synthesis result that confirm 11M gates size at TSMC 0.13um @ 115MHz.

  • PDF

The performance of fast view synthesis using GPU (GPU를 이용한 고속 영상 합성 기법의 성능)

  • Kim, Jaehan;Shin, Hong-Chang;Cheong, Won-Sik;Bang, Gun
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2011.07a
    • /
    • pp.22-24
    • /
    • 2011
  • 본 논문에서는 3차원 디스플레이 시스템에서 다수의 중간 시점 영상을 실시간으로 생성할 수 있도록 GPU 기반의 고속 영상 합성기법을 제안하였으며 그에 대한 성능을 알아본다. 카메라의 기하 정보 및 참조 영상들의 깊이 정보를 이용하여 중간 시점 영상을 생성하였으며, 영상 합성 방법을 GPU에서 병렬 처리함으로써 고속화할 수 있었다. GPU를 효율적으로 다루기 위해 NVIDIA사의 CUDA(Compute Unified Device Architecture)$^TM$를 이용하였다. 제안한 기법은 CUDA의 SIMD(Single Instruction MUltiple Data) 구조를 사용하여 중간 영상 합성을 처리할 수 있도록 설계하였다. 본 논문은 고속 영상 합성에 중점을 두었고, 제안한 고속화 기법의 결과를 분석함으로써 다시점 3차원 디스플레이 시스템의 적용 가능성을 알아본다.

  • PDF