• Title/Summary/Keyword: superscalar processor

Search Result 58, Processing Time 0.034 seconds

A Performance Study of Embedded Multicore Processor Architectures (임베디드 멀티코어 프로세서의 성능 연구)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.1
    • /
    • pp.163-169
    • /
    • 2013
  • Recently, the importance of embedded system is growing rapidly. In-order to satisfy the real-time constraints of the system, high performance embedded processor is required. Therefore, as in general purpose computer systems, embedded processor should be designed as multicore architecture as well. Using MiBench benchmarks as input, the trace-driven simulation has been performed and analyzed for the 2-core to 16-core embedded processor architectures with different types of cores from simple RISC to in-order and out-of-order superscalar processors, extensively. As a result, the achievable performance is as high as 23 times over the single core embedded RISC processor.

Performance Study of Multicore Digital Signal Processor Architectures (멀티코어 디지털 신호처리 프로세서의 성능 연구)

  • Lee, Jongbok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.4
    • /
    • pp.171-177
    • /
    • 2013
  • Due to the demand for high speed 3D graphic rendering, video file format conversion, compression, encryption and decryption technologies, the importance of digital signal processor system is growing rapidly. In order to satisfy the real-time constraints, high performance digital signal processor is required. Therefore, as in general purpose computer systems, digital signal processor should be designed as multicore architecture as well. Using UTDSP benchmarks as input, the trace-driven simulation has been performed and analyzed for the 2 to 16-core digital signal processor architectures with the cores from simple RISC to in-order and out-of-order superscalar processors for the various window sizes, extensively.

Simulation of YUV-Aware Instructions for High-Performance, Low-Power Embedded Video Processors (고성능, 저전력 임베디드 비디오 프로세서를 위한 YUV 인식 명령어의 시뮬레이션)

  • Kim, Cheol-Hong;Kim, Jong-Myon
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.13 no.5
    • /
    • pp.252-259
    • /
    • 2007
  • With the rapid development of multimedia applications and wireless communication networks, consumer demand for video-over-wireless capability on mobile computing systems is growing rapidly. In this regard, this paper introduces YUV-aware instructions that enhance the performance and efficiency in the processing of color image and video. Traditional multimedia extensions (e.g., MMX, SSE, VIS, and AltiVec) depend solely on generic subword parallelism whereas the proposed YUV-aware instructions support parallel operations on two-packed 16-bit YUV (6-bit Y, 5-bits U, V) values in a 32-bit datapath architecture, providing greater concurrency and efficiency for color image and video processing. Moreover, the ability to reduce data format size reduces system cost. Experiment results on a representative dynamically scheduled embedded superscalar processor show that YUV-aware instructions achieve an average speedup of 3.9x over the baseline superscalar performance. This is in contrast to MMX (a representative Intel#s multimedia extension), which achieves a speedup of only 2.1x over the same baseline superscalar processor. In addition, YUV-aware instructions outperform MMX instructions in energy reduction (75.8% reduction with YUV-aware instructions, but only 54.8% reduction with MMX instructions over the baseline).

Timing Analysis of Out-of-order Superscalar Processor Programs Using ACSR (ACSR을 이용한 비순차 슈퍼스칼라)

  • 이기흔;최진영
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10a
    • /
    • pp.697-699
    • /
    • 1998
  • 본 논문은 프로세서 알제브라의 하나인 ACSR을 이용하여 파이프라인 비순차 슈퍼스칼라 프로세서의 타이밍 특성과 자원 제한을 묘사하기 위한 정형기법을 제시한다. ACSR의 두드러진 특징은 시간, 자원, 우선 순위의 개념이 알제브라에서 직접적으로 제공되어 진다는 것이다. 여기서의 접근 방식은 슈퍼스칼라 프로세서의 레지스터를 ACSR 자원으로, 명령어를 ACSR 프로세서로의 모델링하는 것이다. 결과적으로 얻어지는 ACSR식에서 각각의 클럭 주기에서 어떻게 명령어가 실행되고 레지스트들이 이용되는지 확인할 수 있으며 이 모델링을 이용해서 비순차 슈퍼스칼라 프로세서 구조를 검증하거나 분석하는 것이 가능하다.

  • PDF

The Integer Superscalar Processor Performance Model Using Dependency Trees and the Relative ILP (종속 트리와 상대적 병렬도를 이용하는 수퍼스칼라 프로세서의 정수형 성능 예측 모델)

  • 이종복
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.10c
    • /
    • pp.13-15
    • /
    • 2001
  • 최근에 이르러 프로세서의 병렬성을 분석적 기법으로 예측하기 위한 연구가 활발해지면서 프로세서의 성능 예측 모델에 대한중요성이 대두되고 있다. 그러나 기존의 연구는 현재 광범위하게 사용되고 있는 다중 분기 예측법을 이용하는 프로세서에 대하여 분기 차수와 관계없는 재귀적 성능 모델을 제공해주지 않는다. 본 논문에서는 이것을 해결하기 위하여, 매 싸이클마다 명령어 종속 트리를 구성하고 종속인 명령어 간에 상대적인 병렬도 갓을 부여하여 성능 예측 모델 입력 데이타를 측정하였다. 그 곁과, 다중 분기 예측법을 사용하는 프로세서에서 정수형 프로그램에 대한 성능을 기존의 성능모델보다 작은 상대 오차로 예측할 수 있다.

  • PDF

A Processor Architecture for 802.11 Wireless LAN Environment (802.11 Wireless LAN 환경에 적합한 프로세서 구조)

  • 전성재;홍인표;이용주;이용석;정진우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.10a
    • /
    • pp.550-552
    • /
    • 2004
  • 최근 휴대폰, PDA, 노트북 등의 모바일 제품의 인기에 따라 모바일에 대한 소비자의 관심이 증대되고 있으며, 대형 네트워크 장비보다 소형의 개인 휴대용의 모바일 제품의 성장세가 두드러지고 있다. 이러한 추세에 따라 무선랜에 대한 관심도 증대되고 있다. 본 논문에서는 기존의 ARM 프로세서를 기반으로 802..11 무선랜 환경에 맞는 네트워크 프로세서 구조에 대한 연구를 수행하였다. 그 결과 전송과 수신이 빈번하게 동시에 일어나는 무선랜 환경에서는 multi-threading을 처리할 수 있는 프로세서가 구조(SMT)가 Superscalar 구조에 비해 높은 성능 향상 폭을 보여주었다

  • PDF

A Study of Trace-driven Simulation for Multi-core Processor Architectures (멀티코어 프로세서의 명령어 자취형 모의실험에 대한 연구)

  • Lee, Jong-Bok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.12 no.3
    • /
    • pp.9-13
    • /
    • 2012
  • In order to overcome the complexity and power problems of superscalar processors, the multi-core architecture has been prevalent recently. Although the execution-driven simulation is wide spread, the trace-driven simulation has speed advantages over the execution-driven simulation. We present a methodology to simulate multi-core architecture using trace-driven simulator. Using SPEC 2000 benchmarks as input, the trace-driven simulation has been performed for the cores ranging from 2 to 16 extensively. As a result, the 16-core processor resulted in 4.1 IPC and 13.3 times speed up over single-core processor on the average.

Study of an In-order SMT Architecture and Grouping Schemes

  • Moon, Byung-In;Kim, Moon-Gyung;Hong, In-Pyo;Kim, Ki-Chang;Lee, Yong-Surk
    • International Journal of Control, Automation, and Systems
    • /
    • v.1 no.3
    • /
    • pp.339-350
    • /
    • 2003
  • In this paper, we propose a simultaneous multithreading (SMT) architecture that improves instruction throughput by exploiting instruction level parallelism (ILP) and thread level parallelism (TLP). The proposed architecture issues and completes instructions belonging to the same thread in exact program order. The issue and completion policy greatly reduces the design complexity and hardware cost of our architecture, compared with others that employ out-of-order issue and completion. On the other hand, when the instructions belong to different threads, the issue and completion orders for those instructions may not necessarily be identical to the fetch order. The processor issues instructions simultaneously from multiple threads to functional units by exploiting ILP and TLP, and by dynamic resource sharing. That parallel execution notably improves performance and resource utilization with minimal additional hardware cost over the conventional superscalar processors. This paper proposes an SMT architecture with grouping as well as one without grouping. Without grouping, all threads dynamically and flexibly share most resources. On the other hand, in the SMT architecture with grouping, in which resources and threads are divided into several groups for design simplification, resources are shared only among threads belonging to the same group as those resources. Simulation results show that our processors with four and eight threads improve performance by three or more times over the conventional superscalar processor with comparable execution resources and policies, and that reasonable grouping reduces the design complexity of SMT processors with little negative effect on performance.

Design of a Graphic Processor for Multimedia Data Processing (멀티미디어 데이타 처리를 위한 그래픽 프로세서 설계)

  • 고익상;한우종;선우명동
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.36C no.10
    • /
    • pp.56-65
    • /
    • 1999
  • This paper presents an architecture and its instruction set for a graphic coprocessor(GCP) which can be used for a multimedia server. The proposed instruction set employs parallel architecture concepts, such as SIMD and Superscalar. GCP consists of a scheduler and four functional units. The scheduler solves an instruction bottleneck problem causing by sharing with four general processors(GPs). GCP can execute up to 4 instructions in parallel. It consists of about 56,000 gates and operates at 30 MHz clock frequency due to speed limitation of SOG technology. GCP meets the real-time DCT algorithm requirement of the CIF image format and can process up to 63 frames/sec for the DCT Algorithm and 21 frames/sec for the Full Block matching Algorithm of the CIF image format.

  • PDF

An Optimal Instruction Fetch Strategy for SMT Processors (SMT 프로세서에 최적화된 명령어 페치 전략에 관한 연구)

  • 홍인표;문병인;김문경;이용석
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.5C
    • /
    • pp.512-521
    • /
    • 2002
  • Recently, conventional superscalar RISC processors arrive their performance limit, and many researches on the next-generation architecture are concentrated on SMT(Simultaneous Multi-Threading). In SMT processors, multiple threads are executed simultaneously and share hardware resources dynamically. In this case, it is more important to supply instructions from multiple threads to processor core efficiently than ever. Because SMT architecture shows higher IPC(Instructions per cycle) than superscalar architecture, performance is influenced by fetch bandwidth and the size of fetch queue. Moreover, to use TLP(Thread Level Parallelism) efficiently, fetch thread selection algorithm and fetch bandwidth for each selected threads must be carefully designed. Thus, in this paper, the performance values influenced by these factors are analyzed. Based on the results, an optimal instruction fetch strategy for SMT processors is proposed.