Search | Korea Science

A Prefetch Architecture with Efficient Branch Prediction for a 64-bit 4-way Superscalar Microprocessor (64비트 4-way 수퍼스칼라 마이크로프로세서의 효율적인 분기 예측을 수행하는 프리페치 구조)

문상국;문병인;이용환;이용석
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.25 no.11B
- /
- pp.1939-1947
- /
- 2000
본 논문에서는 명령어의 효율적인 페치를 위해 분기 타겟 주소 전체를 사용하지 않고 캐쉬 메모리(cache memory) 내의 적은 비트 수로 인덱싱 하여 한 클럭 사이클 안에 최대 4개의 명령어를 다음 파이프라인으로 보내줄 수 있는 방법을 제시한다. 본 프리페치 유닛은 크게 나누어 3개의 영역으로 나눌 수 있는데, 분기에 관련하여 미리 부분적으로 명령어를 디코드 하는 프리디코드(predecode) 블록, 타겟 주소(NTA : Next Target Address) 테이블 영역을 추가시킨 명령어 캐쉬(instruction cache) 블록, 전체 유닛을 제어하고 가상 주소를 관리하는 프리페치(prefetch) 블록으로 나누어진다. 사용된 명령어들은 SPARC(Scalable Processor ARChitecture) V9에 기준 하였고 구현은 Verilog-HDL(Hardwave Description Language)을 사용하여 기능 수준으로 기술되고 검증되었다. 구현된 프리페치 유닛은 명령어 흐름에 분기가 존재하더라도 단일 사이클 안에 4개까지의 명령어들을 정확한 예측 하에 다음 파이프라인으로 보내줄 수 있다. 또한 NTA를 사용한 방법은 같은 수의 레지스터 비트를 사용하였을 때 BTB(Branch Target Buffer)를 사용하는 방법과 비교하여 2배정도 많은 개수의 분기 명령 주소를 저장할 수 있는 장점이 있다.
PDF

A Wide-Window Superscalar Microprocessor Profiling Performance Model Using Multiple Branch Prediction (대형 윈도우에서 다중 분기 예측법을 이용하는 수퍼스칼라 프로세서의 프로화일링 성능 모델)

Lee, Jong-Bok
- The Transactions of The Korean Institute of Electrical Engineers
- /
- v.58 no.7
- /
- pp.1443-1449
- /
- 2009
This paper presents a profiling model of a wide-window superscalar microprocessor using multiple branch prediction. The key idea is to apply statistical profiling technique to the superscalar microprocessor with a wide instruction window and a multiple branch predictor. The statistical profiling data are used to obtain a synthetical instruction trace, and the consecutive multiple branch prediction rates are utilized for running trace-driven simulation on the synthesized instruction trace. We describe our design and evaluate it with the SPEC 2000 integer benchmarks. Our performance model can achieve accuracy of 8.5 % on the average.
PDF KSCI

A design of floating-point multiplier for superscalar microprocessor (수퍼스칼라 마이크로프로세서용 부동 소수점 승산기의 설계)

최병윤;이문기
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.21 no.5
- /
- pp.1332-1344
- /
- 1996
This paper presents a pipelined floating point multiplier(FMUL) for superscalar microprocessors that conbines radix-16 recoding scheme based on signed-digit(SD) number system and new rouding and normalization scheme. The new rounding and normalization scheme enable the FMUL to compute sticky bit in parallel with multiple operation and elminate timing delay due to post-normalization. By expoliting SD radix-16 recoding scheme, we can achieves further reduction of silicon area and computation time. The FMUL can execute signle-precision or double-precision floating-point multiply operation through three-stage pipelined datapath and support IEEE standard 754. The algorithm andstructure of the designed multiplier have been successfully verified through Verilog HOL modeling and simulation.
PDF

A design of floating-point arithmetic unit for superscalar microprocessor (수퍼스칼라 마이크로프로세서용 부동 소수점 연산회로의 설계)

최병윤;손승일;이문기
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.21 no.5
- /
- pp.1345-1359
- /
- 1996
This paper presents a floating point arithmetic unit (FPAU) for supescalar microprocessor that executes fifteen operations such as addition, subtraction, data format converting, and compare operation using two pipelined arithmetic paths and new rounding and normalization scheme. By using two pipelined arithmetic paths, each aritchmetic operation can be assigned into appropriate arithmetic path which high speed operation is possible. The proposed normalization an rouding scheme enables the FPAU to execute roundig operation in parallel with normalization and to reduce timing delay of post-normalization. And by predicting leading one position of results using input operands, leading one detection(LOD) operation to normalize results in the conventional arithmetic unit can be eliminated. Because the FPAU can execuate fifteen single-precision or double-precision floating-point arithmetic operations through three-stage pipelined datapath and support IEEE standard 754, it has appropriate structure which can be ingegrated into superscalar microprocessor.
PDF

The Analytic Performance Model of the Superscalar Processor Using Multiple Branch Prediction (독립시행의 정리를 이용하는 수퍼스칼라 프로세서의 다중 분기 예측 성능 모델)

이종복
- Proceedings of the IEEK Conference
- /
- 1999.06a
- /
- pp.1009-1012
- /
- 1999
An analytical performance model that can predict the performance of a superscalar processor employing multiple branch prediction is introduced. The model is based on the conditional independence probability and the basic block size of instructions, with the degree of multiple branch prediction, the fetch rate, and the window size of a superscalar architecture. Trace driven simulation is performed for the subset of SPEC integer benchmarks, and the measured IPCs are compared with the results derived from the model. As the result, our analytic model could predict the performance of the superscalar processor using multiple branch prediction within 6.6 percent on the average.
PDF

A Theoretical Superscalar Microprocessor Performance Model with Limited Functional Units Using Instruction Dependencies (한정된 연산유닛에서 명령어 종속성을 이용하는 수퍼스칼라 프로세서의 이론적 성능 모델)

Lee, Jong-Bok
- The Transactions of The Korean Institute of Electrical Engineers
- /
- v.59 no.2
- /
- pp.423-428
- /
- 2010
In the initial design phase of superscalar microprocessors, a performance model is necessary. A theoretic performance model is very useful since performance for various architecture parameters can be obtained by simply computing equations, without repeating simulations, Previous studies established theoretic performance models using the relation between the instruction window size and the issue width, with the penalties due to branch mispredictions and cache misses. However, the study was intended for unlimited number of functional units, which is insufficient for the real case application. This paper proposes a superscalar microprocessor theoretical performance model which also works for the limited functional units. To enhance the accuracy of our limited functional unit model, instruction dependency rates are employed. By using trace-driven data of SPEC 2000 integer programs as input, this paper shows that the theoretically computed performance of superscalar microprocessor with limited number of functional units is quite similar to the measured performance.
https://doi.org/10.5370/KIEE.2010.59.2.423 인용 PDF KSCI

Instruction Queue Architecture for Low Power Microprocessors (마이크로프로세서 전력소모 절감을 위한 명령어 큐 구조)

Choi, Min;Maeng, Seung-Ryoul
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.45 no.11
- /
- pp.56-62
- /
- 2008
Modern microprocessors must deliver high application performance, while the design process should not subordinate power. In terms of performance and power tradeoff, the instructions window is particularly important. This is because a large instruction window leads to achieve high performance. However, naive scaling conventional instruction window can severely affect the complexity and power consumption. This paper explores an architecture level approach to reduce power dissipation. We propose a low power issue logic with an efficient tag translation. The direct lookup table (DTL) issue logic eliminates the associative wake-up of conventional instruction window. The tag translation scheme deals with data dependencies and resource conflicts by using bit-vector based structure. Experimental results show that, for SPEC2000 benchmarks, the proposed design reduces power consumption by 24.45% on average over conventional approach.
PDF KSCI

A Performance Study of Multi-core Out-of-Order Superscalar Processor Architecture (멀티코어 비순차 수퍼스칼라 프로세서의 성능 연구)

Lee, Jong-Bok
- The Transactions of The Korean Institute of Electrical Engineers
- /
- v.61 no.10
- /
- pp.1502-1507
- /
- 2012
In order to overcome the hardware complexity and power consumption problems, recently the multi-core architecture has been prevalent. For hardware simplicity, usually RISC processor is adopted as the unit core processor. However, if the performance of unit core processor is enhanced, the overall performance of the multi-core processor architecture can be further increased. In this paper, out-of-order superscalar processor is utilized for the multi-core processor architecture. Using SPEC 2000 benchmarks as input, the trace-driven simulation has been performed for the out-of-order superscalar cores between 2 and 16 extensively. As a result, the 16-core out-of-order superscalar processor for the window size of 16 resulted in 17.4 times speed up over the single-core out-of-order superscalar processor, and 50 times speed up over the single core RISC processor. When compared for the same number of cores on the average, the multi-core out-of-order superscalar processor performance achieved 3.2 times speed up over the multi-core RISC processor and 1.6 times speed up over the multi-core in-order superscalar processor.
https://doi.org/10.5370/KIEE.2012.61.10.1502 인용 PDF KSCI

Optimistic Colescing Technique for Copy Elimination in ILP Instruction Scheduling (ILP 명령 스케쥴링에서의 복사 제거를 위한 낙관적 융합 기법)

Park, Jin-Pyo;Mun, Su-Muk
- Journal of KIISE:Software and Applications
- /
- v.26 no.5
- /
- pp.692-701
- /
- 1999
수퍼스칼라(superscalar)나 VLIW 와 같은 명령어 수준 병렬화(ILP) 프로세서의 성능을 극대화하는 과감한 명령어 스케쥴링은 소프트웨어 파이프라이닝과같은 스케쥴링 과정을 거치면서 일반적인 복사 명령어 제거 기법으로 없앨 수 없는 서로 간섭하는 복사 명령을 많이 만들어내는데 루프 내부에 생성된 이러한 복사명령은 적절한 루프 펼침을 수행하여 간섭관계를 없앰으로서 제거할 수 있다. 본 논문에서는 이와 같이 루프 펼침이 수행된 루프 내부의 복사명령을 제거하는 기법으로 그래프 컬러링 상에 구현한 낙관적 융합기법을 제안한다. 그래프 컬러링에서의 융합기법은 간선의 개수가 많은 노드를 만들어 낼수 있으므로 채색성에 부정적인 영향을 주는 것으로 알려져 왔으나 본 기법에서는 융합되는 노드에 동시에 간섭하는 노드의 간선의 수가 줄어드는 긍정적인 영향을 최대한 이용하여 채색성을 높이고 융합된 노드에 대한 실제 버림(spill)이 일어나는 경우 유효 범위 분절(live range splitting)을 통하여 버림의 부담을 최대한 줄이도록 하였으며 이를 VLIW 스케쥴링 된 SPEC 정수벤치마크 루프내부의 복사 명령 제거에 적용한 결과 제거 가능한 복사 명령의 99%를 제거하면서도 버림명령은 다른 융합 기법과 비교하여 가장 적게 발생하는 우수한 결과를 얻을수 있었다.

Analytical Models and their Performance Analysis of Superscalar Processors (수퍼스칼라 프로세서의 해석적 모델 및 성능 분석)

Kim, Hak-Jun;Kim, Seon-Mo;Choe, Sang-Bang
- Journal of KIISE:Computer Systems and Theory
- /
- v.26 no.7
- /
- pp.847-862
- /
- 1999
본 논문에서는 유한버퍼의(finite-buffered) 동기화된(synchronous) 큐잉모델(queueing model)을 이용하여 명령어들간의 병렬성, 분기명령의 빈도수, 분기예측(branch prediction)의 정확도, 캐쉬미스 등의 파라미터들을 고려하여 프로세서의 명령어 실행율을 예측하며 캐쉬의 성능과 파이프라인 성능간의 관계를 분석할 수 있는 새로운 해석적 모델을 제안하였다. 해석적 모델은 모델의 타당성을 검증하기 위해서 시뮬레이션을 수행하여 얻은 결과와 비교하였다. 해석적 모델과 시뮬레이션을 비교한 결과 대부분 10% 오차 내에서 일치하였다. 본 연구를 통하여 얻은 해석적 모델을 사용하면 시뮬레이션에서는 드러나지 않는 성능제약의 원인에 대한 명확한 규명이 가능하기 때문에 성능향상을 위한 설계자료를 얻을 수 있으며, 시스템 성능 밸런스를 위한 캐쉬와 비순차이슈 파이프라인 성능간의 관계에 대한 정확한 분석이 가능하다.Abstract This research presents a novel analytic model to predict the instruction execution rate of superscalar processors using the queuing model with finite-buffer size and synchronous operation mode. The proposed model is also able to analyze the performance relationship between cache and pipeline. The proposed model takes into account various kinds of architectural parameters such as instruction-level parallelism, branch probability, the accuracy of branch prediction, cache miss, and etc.. To prove the correctness of the model, we performed extensive simulations and compared the results with the analytic model. Simulation results showed that the proposed model can estimate the average execution rate accurately within 10% error compared to simulation results. The proposed model can explain the causes of performance bottleneck which cannot be uncovered by the simulation method only. The model is also able to show the effect of the cache miss on the performance of out-of-order issue superscalar processors, which can provide an valuable information in designing a balanced system.

Search Result 41, Processing Time 0.022 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)