• Title/Summary/Keyword: data level parallelism

Search Result 47, Processing Time 0.059 seconds

Static forwardin: an approach to reduce data hazards in VLIW processor (정적 포워딩에 의한 VLIW 프로세서의 데이터 hazard 처리)

  • 박형준;김이섭
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.35C no.2
    • /
    • pp.1-9
    • /
    • 1998
  • To achieve high performance in VLIW processors, they must exploit the parallelism on application programs. Data dependency makes it difficult to find the instruction-level parallelism. Among the three kinds of data dependency, true dependency causes RAW(Read After Wirte) hazards that occur most frequently in VILW processors. Forwarding is a widely used technique to reduce the performance degradation caused by RAW hazards. However, forwarding requires too much area of the chip when it is applied to VLIW processors. In this paper, static forwarding is proposed to reduce the hardware cost of forwarding circuits. It needs an extended compiler to detect RAW hazards and control the proposed forwarding scheme via instruction. And it uses the modified register file to shrink the area of forwarding path. VLIW Processor Model is also designed to verify static forwarding. This paper describes the operation of static forwarding and the comparison with the conventional forwarding.

  • PDF

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.24 no.4
    • /
    • pp.1-9
    • /
    • 2019
  • LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.

Computation-Communication Overlapping in AES-CCM Using Thread-Level Parallelism on a Multi-Core Processor (멀티코어 프로세서의 쓰레드-수준 병렬성을 활용한 AES-CCM 계산-통신 중첩화)

  • Lee, Eun-Ji;Lee, Sung-Ju;Chung, Yong-Wha;Lee, Myung-Ho;Min, Byoung-Ki
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.8
    • /
    • pp.863-867
    • /
    • 2010
  • Multi-core processors are becoming increasingly popular. As they are widely adopted in embedded systems as well as desktop PC's, many multimedia applications are being parallelized on multi-core platforms. However, it is difficult to parallelize applications with inherent data dependencies such as encryption algorithms for multimedia data. In order to overcome this limit, we propose a technique to overlap computation and communication using an otherwise idle core in this paper. In particular, we interpret the problem of multimedia computation and communication as a pipeline design problem at the application program level, and derive an optimal number of stages in the pipeline.

Data Level Parallelism for H.264/AVC Decoder on a Multi-Core Processor and Performance Analysis (멀티코어 프로세서에서의 H.264/AVC 디코더를 위한 데이터 레벨 병렬화 성능 예측 및 분석)

  • Cho, Han-Wook;Jo, Song-Hyun;Song, Yong-Ho
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.8
    • /
    • pp.102-116
    • /
    • 2009
  • There have been lots of researches for H.264/AVC performance enhancement on a multi-core processor. The enhancement has been performed through parallelization methods. Parallelization methods can be classified into a task-level parallelization method and a data level parallelization method. A task-level parallelization method for H.264/AVC decoder is implemented by dividing H.264/AVC decoder algorithms into pipeline stages. However, it is not suitable for complex and large bitstreams due to poor load-balancing. Considering load-balancing and performance scalability, we propose a horizontal data level parallelization method for H.264/AVC decoder in such a way that threads are assigned to macroblock lines. We develop a mathematical performance expectation model for the proposed parallelization methods. For evaluation of the mathematical performance expectation, we measured the performance with JM 13.2 reference software on ARM11 MPCore Evaluation Board. The cycle-accurate measurement with SoCDesigner Co-verification Environment showed that expected performance and performance scalability of the proposed parallelization method was accurate in relatively high level

Design of a Hybrid Data Value Predictor with Dynamic Classification Capability in Superscalar Processors (슈퍼스칼라 프로세서에서 동적 분류 능력을 갖는 혼합형 데이타 값 예측기의 설계)

  • Park, Hee-Ryong;Lee, Sang-Jeong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.8
    • /
    • pp.741-751
    • /
    • 2000
  • To achieve high performance by exploiting instruction level parallelism aggressively in superscalar processors, it is necessary to overcome the limitation imposed by control dependences and data dependences which prevent instructions from executing parallel. Value prediction is a technique that breaks data dependences by predicting the outcome of an instruction and executes speculatively its data dependent instruction based on the predicted outcome. In this paper, a hybrid value prediction scheme with dynamic classification mechanism is proposed. We design a hybrid predictor by combining the last predictor, a stride predictor and a two-level predictor. The choice of a predictor for each instruction is determined by a dynamic classification mechanism. This makes each predictor utilized more efficiently than the hybrid predictor without dynamic classification mechanism. To show performance improvements of our scheme, we simulate the SPECint95 benchmark set by using execution-driven simulator. The results show that our scheme effect reduce of 45% hardware cost and 16% prediction accuracy improvements comparing with the conventional hybrid prediction scheme and two-level value prediction scheme.

  • PDF

Efficient of The Data Value Predictor in Superscalar Processors (슈퍼스칼라 프로세서에서 데이터 값 예측기의 성능효과)

  • 박희룡;전병찬;이상정
    • Proceedings of the IEEK Conference
    • /
    • 2000.06c
    • /
    • pp.55-58
    • /
    • 2000
  • To achieve high performance by exploiting instruction level parallelism(ILP) aggressively in superscalar processors, value prediction is used. Value prediction is a technique that breaks data dependences by predicting the outcome of an instruction and executes speculatively it's data dependent instruction based on the predicted outcome. In this paper, the performance of a hybrid value prediction scheme with dynamic classification mechanism is measured and analyzed by using execution-driven simulator for SPECint95 benchmark set.

  • PDF

Adaptive Load Balancing Scheme using a Combination of Hierarchical Data Structures and 3D Clustering for Parallel Volume Rendering on GPU Clusters (계층 자료구조의 결합과 3차원 클러스터링을 이용하여 적응적으로 부하 균형된 GPU-클러스터 기반 병렬 볼륨 렌더링)

  • Lee Won-Jong;Park Woo-Chan;Han Tack-Don
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.33 no.1_2
    • /
    • pp.1-14
    • /
    • 2006
  • Sort-last parallel rendering using a cluster of GPUs has been widely used as an efficient method for visualizing large- scale volume datasets. The performance of this method is constrained by load balancing when data parallelism is included. In previous works static partitioning could lead to self-balance when only task level parallelism is included. In this paper, we present a load balancing scheme that adapts to the characteristic of volume dataset when data parallelism is also employed. We effectively combine the hierarchical data structures (octree and BSP tree) in order to skip empty regions and distribute workload to corresponding rendering nodes. Moreover, we also exploit a 3D clustering method to determine visibility order and save the AGP bandwidths on each rendering node. Experimental results show that our scheme can achieve significant performance gains compared with traditional static load distribution schemes.

A Hybrid Value Predictor using Speculative Update in Superscalar Processors (슈퍼스칼라 프로세서에서 모험적 갱신을 사용한 하이브리드 결과값 예측기)

  • Park, Hong-Jun;Sin, Yeong-Ho;Jo, Yeong-Il
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.28 no.11
    • /
    • pp.592-600
    • /
    • 2001
  • To improve the performance of wide-issue Superscalar microprocessors, it is essential to increase the width of instruction fetch and issue rate. Data dependences are major hurdle to exploit ILP(Instruction-Level Parallelism) efficiently, so several related works have suggested that the limits imposed by data dependences can be overcome to some extent with the use of the data value prediction. But the suggested mechanisms may access the same value prediction table entry again before they have been updated with a real data value. They will cause incorrect value prediction by using stable data and incur misprediction penalty and lowering performance. In this paper, we propose a new hybrid value predictor which achieve high performance by reducing stale data. Because the proposed hybrid value predictor can update the prediction table speculatively, it efficiently reduces the number of mispredicted instruction due to stable due to stale data. For SPECint95 benchmark programs on the 16-issue superscalar processors, simulation results show that the average prediction accuracy increase from 59% for non-speculative update to 72% for speculative update.

  • PDF

The Performance evaluation of Data Value Predictor in ILP Processor (ILP 프로세서에서 데이터 값 예측기의 성능 평가)

  • 박희룡;전병찬;이상정
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 1998.10a
    • /
    • pp.21-23
    • /
    • 1998
  • 본 논문에서 ILP (Instruction Level Parallelism)의 성능향상을 위하여 데이터 값들을 미리 예측하여 병렬로 이슈(issue)하고 수행하는 기존의 데이터 값 예측기(data value predictor)를 비교 분석하여 각 예측기의 예측율을 측정하고, 2-단계 데이터 값 예측기(Two-Level Data Value Predictor)와 혼합형 데이터 값 예측기(Hydrid Data Value Predictor)에서 발생되는 aiasing 을 측정하기 위해 수정된 데이터 값 예측기를 사용하여 측정한 결과 aliasing은 50% 감소하였지만 예측율에는 영향을 미치지 못함과 데이터 값 예측기의 예측율을 측정한 결과 혼합형 데이터 값 예측기의 예측율이 2-단계 데이터 값 예측기와 스트라이드 데이터 값 예측기(Stride Data Value Predictor)에서 평균 5.7%, 최근 값 예측기(Last Data Value Predictor)보다는 평균 38%의 예측 정확도가 높음을 입증하였다.

  • PDF

Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

  • Oh, Jaeg-Eun;Hwang, Seok-Joong;Nguyen, Huong Giang;Kim, A-Reum;Kim, Seon-Wook;Kim, Chul-Woo;Kim, Jong-Kook
    • ETRI Journal
    • /
    • v.30 no.4
    • /
    • pp.576-586
    • /
    • 2008
  • In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.

  • PDF