• Title/Summary/Keyword: 프로세서 코어

Search Result 312, Processing Time 0.024 seconds

A Study on Multiplier Architectures Optimized for 32-bit RISC Processor with 3-Stage Pipeline (32비트 3단 파이프라인을 가진 RISC 프로세서에 최적화된 Multiplier 구조에 관한 연구)

  • 정근영;박주성;김석찬
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.11
    • /
    • pp.123-130
    • /
    • 2004
  • This paper describes a multiplier architecture optimized for 32 bit RISC processor with 3-stage pipeline. The multiplier of ARM7, the target processor, is variably carried out on the execution stage of pipeline within 7 cycles. The included multiplier employs a modified Booth's algerian to produce 64 bit multiplication and addition product and it has 6 separate instructions. We analyzed several multiplication algorithm such as radix4-32${\times}$8, radix4-32${\times}$16 and radix8-32${\times}$32 to decide which multiplication architecture is most fit for a typical architecture of ARM7. VLSI area, cycle delay time and execution cycle number is the index of an efficient design and the final multiplier was designed on these indexes. To verify the operation of embedded multiplier, it was simulated with various audio algorithms.

Multimedia Extension Instructions and Optimal Many-core Processor Architecture Exploration for Portable Ultrasonic Image Processing (휴대용 초음파 영상처리를 위한 멀티미디어 확장 명령어 및 최적의 매니코어 프로세서 구조 탐색)

  • Kang, Sung-Mo;Kim, Jong-Myon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.17 no.8
    • /
    • pp.1-10
    • /
    • 2012
  • This paper proposes design space exploration methodology of many-core processors including multimedia specific instructions to support high-performance and low power ultrasound imaging for portable devices. To explore the impact of multimedia instructions, we compare programs using multimedia instructions and baseline programs with a same many-core processor in terms of execution time, energy efficiency, and area efficiency. Experimental results using a $256{\times}256$ ultrasound image indicate that programs using multimedia instructions achieve 3.16 times of execution time, 8.13 times of energy efficiency, and 3.16 times of area efficiency over the baseline programs, respectively. Likewise, programs using multimedia instructions outperform the baseline programs using a $240{\times}320$ image (2.16 times of execution time, 4.04 times of energy efficiency, 2.16 times of area efficiency) as well as using a $240{\times}400$ image (2.25 times of execution time, 4.34 times of energy efficiency, 2.25 times of area efficiency). In addition, we explore optimal PE architecture of many-core processors including multimedia instructions by varying the number of PEs and memory size.

Improving Haskell GC-Tuning Time Using Divide-and-Conquer (분할 정복법을 이용한 Haskell GC 조정 시간 개선)

  • An, Hyungjun;Kim, Hwamok;Liu, Xiao;Kim, Yeoneo;Byun, Sugwoo;Woo, Gyun
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.6 no.9
    • /
    • pp.377-384
    • /
    • 2017
  • The performance improvement of a single core processor has reached its limit since the circuit density cannot be increased any longer due to overheating. Therefore, the multicore and manycore architectures have emerged as viable approaches and parallel programming becomes more important. Haskell, a purely functional language, is getting popular in this situation since it naturally supports parallel programming owing to its beneficial features including the implicit parallelism in evaluating expressions and the monadic tools supporting parallel constructs. However, the performance of Haskell parallel programs is strongly influenced by the performance of the run-time system including the garbage collector. Though a memory profiling tool namely GC-tune has been suggested, we need a more systematic way to use this tool. Since GC-tune finds the optimal memory size by executing the target program with all the different possible GC options, the GC-tuning time takes too long. This paper suggests a basic divide-and-conquer method to reduce the number of GC-tune executions by reducing the search area by one-quarter for every searching step. Applying this method to two parallel programs, a maximally independent set and a K-means programs, the memory tuning time is reduced by 7.78 times with accuracy 98% on average.

Implementation of Multicore-Aware Load Balancing on Clusters through Data Distribution in Chapel (클러스터 상에서 다중 코어 인지 부하 균등화를 위한 Chapel 데이터 분산 구현)

  • Gu, Bon-Gen;Carpenter, Patrick;Yu, Weikuan
    • The KIPS Transactions:PartA
    • /
    • v.19A no.3
    • /
    • pp.129-138
    • /
    • 2012
  • In distributed memory architectures like clusters, each node stores a portion of data. How data is distributed across nodes influences the performance of such systems. The data distribution scheme is the strategy to distribute data across nodes and realize parallel data processing. Due to various reasons such as maintenance, scale up, upgrade, etc., the performance of nodes in a cluster can often become non-identical. In such clusters, data distribution without considering performance cannot efficiently distribute data on nodes. In this paper, we propose a new data distribution scheme based on the number of cores in nodes. We use the number of cores as the performance factor. In our data distribution scheme, each node is allocated an amount of data proportional to the number of cores in it. We implement our data distribution scheme using the Chapel language. To show our data distribution is effective in reducing the execution time of parallel applications, we implement Mandelbrot Set and ${\pi}$-Calculation programs with our data distribution scheme, and compare the execution times on a cluster. Based on experimental results on clusters of 8-core and 16-core nodes, we demonstrate that data distribution based on the number of cores can contribute to a reduction in the execution times of parallel programs on clusters.

Real-time Implementation of the G.729 Annex A Using ARM9 $Thumb^{\circledR}$ Processor Core (ARM9 $Thumb^{\circledR}$ 프로세서 코어를 이용한 G.729A의 실시간 구현)

  • 성호상;이동원
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.7
    • /
    • pp.63-68
    • /
    • 2001
  • This paper describes the details of ITU-T SGIS G.729A speech coder implementation using ARM9 Thumb/sup R/ processor core and various techniques used in the optimization process. ITU-T G.729 speech coder is the standard of the toll quality 8 kbit/s speech coding. The input to the speech encoder is assumed to be a 16 bits PCM signal at a sampling rate of 8000 samples per second. G.729A is reduced complexity version of the G.729 coder. This version is bit stream interoperable with the full version. The implemented coder requires 34.8 MIPS for the encoder and 8.1 MIPS for the decoder, 36.5 kBytes of program ROM and 6.3 kBytes of data RAM, respectively. The implemented coder is tested against the set of 9 test vectors provided by ITU-T for bit exact implementation.

  • PDF

Acceleration of Intrusion Detection for Multi-core Video Surveillance Systems (멀티 코어 프로세서 기반의 영상 감시 시스템을 위한 침입 탐지 처리의 가속화)

  • Lee, Gil-Beom;Jung, Sang-Jin;Kim, Tae-Hwan;Lee, Myeong-Jin
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.12
    • /
    • pp.141-149
    • /
    • 2013
  • This paper presents a high-speed intrusion detection process for multi-core video surveillance systems. The high-speed intrusion detection was designed to a parallel process. Based on the analysis of the conventional process, a parallel intrusion detection process was proposed so as to be accelerated by utilizing multiple processing cores in contemporary computing systems. The proposed process performs the intrusion detection in a per-frame parallel manner, considering the data dependency between frames. The proposed process was validated by implementing a multi-threaded intrusion detection program. For the system having eight processing cores, the detection speed of the proposed program is higher than that of the conventional one by up to 353.76% in terms of the frame rate.

Design and Implementation of Multi-View 3D Video Player (다시점 3차원 비디오 재생 시스템 설계 및 구현)

  • Heo, Young-Su;Park, Gwang-Hoon
    • Journal of Broadcast Engineering
    • /
    • v.16 no.2
    • /
    • pp.258-273
    • /
    • 2011
  • This paper designs and implements a multi-view 3D video player system which is operated faster than existing video player systems. The structure for obtaining the near optimum speed in a multi-processor environment by parallelizing the component modules is proposed to process large volumes of multi-view image data at high speed. In order to use the concurrency of bottleneck, we designed image decoding, synthesis and rendering modules in a pipeline structure. For load balancing, the decoder module is divided into the unit of viewpoint, and the image synthesis module is geometrically divided based on synthesized images. As a result of this experiment, multi-view images were correctly synthesized and the 3D sense could be felt when watching the images on the multi-view autostereoscopic display. The proposed application processing structure could be used to process large volumes of multi-view image data at high speed, using the multi-processors to their maximum capacity.

Multi-Dimensional Record Scan with SIMD Vector Instructions (SIMD 벡터 명령어를 이용한 다차원 레코드 스캔)

  • Cho, Sung-Ryong;Han, Hwan-Soo;Lee, Sang-Won
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.6
    • /
    • pp.732-736
    • /
    • 2010
  • Processing a large amount of data becomes more important than ever. Particularly, the information queries which require multi-dimensional record scan can be efficiently implemented with SIMD instruction sets. In this article, we present a SIMD record scan technique which employs row-based scanning. Our technique is different from existing SIMD techniques for predicate processes and aggregate operations. Those techniques apply SIMD instructions to the attributes in the same column of the database, exploiting the column-based record organization of the in-memory database systems. Whereas, our SIMD technique is useful for multi-dimensional record scanning. As the sizes of registers and the memory become larger, our row-based SIMD scan can have bigger impact on the performance. Moreover, since our technique is orthogonal to the parallelization techniques for multi-core processors, it can be applied to both uni-processors and multi-core processors without too many changes in the software architectures.

A 8192-point pipelined FFT/IFFT processor using two-step convergent block floating-point scaling technique (2단계 수렴 블록 부동점 스케일링 기법을 이용한 8192점 파이프라인 FFT/IFFT 프로세서)

  • 이승기;양대성;신경욱
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.10C
    • /
    • pp.963-972
    • /
    • 2002
  • An 8192-point pipelined FFT/IFFT processor core is designed, which can be used in multi-carrier modulation systems such as DUf-based VDSL modem and OFDM-based DVB system. In order to improve the signal-to-quantization-noise ratio (SQNR) of FFT/IFFT results, two-step convergent block floating-point (TS_CBFP) scaling is employed. Since the proposed TS_CBFP scaling does not require additional buffer memory, it reduces memory as much as about 80% when compared with conventional CBFP methods, resulting in area-and power-efficient implementation. The SQNR of about 60-㏈ is achieved with 10-bit input, 14-bit internal data and twiddle factors, and 16-bit output. The core synthesized using 0.25-$\mu\textrm{m}$ CMOS library has about 76,300 gates, 390K bits RAM, and twiddle factor ROM of 39K bits. Simulation results show that it can safely operate up to 50-㎒ clock frequency at 2.5-V supply, resulting that a 8192-point FFT/IFFT can be computed every 164-${\mu}\textrm{s}$. It was verified by Xilinx FPGA implementation.

Construction of a Compiled-code Simulator Generation System for Efficient Design Exploration in Embedded Core Design (임베디드 코어 설계시 효율적인 설계 공간 탐색을 위한 컴파일드 코드 방식 시뮬레이터 생성 시스템 구축)

  • Kim, Sang-Woo;Hwang, Sun-Young
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.1B
    • /
    • pp.71-79
    • /
    • 2011
  • This paper proposes a compiled-code simulator generation system based-on machine description language for efficient design space exploration in designing an embedded system optimized for a specific application. The proposed system generates a compiled-code simulator which maintains the functional accuracy of an event-driven simulator by determining instruction fetch and decoding processes statically. Generated simulator takes instruction-level and cycle-level simulation for estimating performances in embedded core. To show the efficiency of the constructed compiled-code simulator generator, architecture exploration had been performed for the JPEG encoder application. Starting with MIPS R3000 processor for one embedded core, the proposed system can produce the core showing optimized execution time for the application programming. In this process, a huge amount of simulation time has been used. Cycle-level compiled-code simulator has the functional accuracy and shows performance improvement by 21.7% in terms of simulation speed on the average when compared with an event-driven simulator.