• Title/Summary/Keyword: data level parallelism

Search Result 47, Processing Time 0.028 seconds

A Study on Effect of Code Distribution and Data Replication for Multicore Computing Architectures

  • Cho, Doosan
    • International Journal of Advanced Culture Technology
    • /
    • v.9 no.4
    • /
    • pp.282-287
    • /
    • 2021
  • A multicore system must be able to take full advantage of the program's instruction and data parallelism. This study introduces the data replication technique as a support technique to maximize the program's instruction and data parallelism. Instruction level parallelism can be limited by data dependency. In this case, if data is replicated to each processor core and used, instruction level parallelism can be used to the maximum. The technique proposed in this study can maximize the performance improvement effect when applied to scientific applications such as matrix multiplication operation.

Limits on the efficiency of event-based algorithms for Monte Carlo neutron transport

  • Romano, Paul K.;Siegel, Andrew R.
    • Nuclear Engineering and Technology
    • /
    • v.49 no.6
    • /
    • pp.1165-1171
    • /
    • 2017
  • The traditional form of parallelism in Monte Carlo particle transport simulations, wherein each individual particle history is considered a unit of work, does not lend itself well to data-level parallelism. Event-based algorithms, which were originally used for simulations on vector processors, may offer a path toward better utilizing data-level parallelism in modern computer architectures. In this study, a simple model is developed for estimating the efficiency of the event-based particle transport algorithm under two sets of assumptions. Data collected from simulations of four reactor problems using OpenMC was then used in conjunction with the models to calculate the speedup due to vectorization as a function of the size of the particle bank and the vector width. When each event type is assumed to have constant execution time, the achievable speedup is directly related to the particle bank size. We observed that the bank size generally needs to be at least 20 times greater than vector size to achieve vector efficiency greater than 90%. When the execution times for events are allowed to vary, the vector speedup is also limited by differences in the execution time for events being carried out in a single event-iteration.

Design of Parallel Processing of Lane Detection System Based on Multi-core Processor (멀티코어를 이용한 차선 검출 병렬화 시스템 설계)

  • Lee, Hyo-Chan;Moon, Dai-Tchul;Park, In-hag;Heo, Kang
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.9
    • /
    • pp.1778-1784
    • /
    • 2016
  • we improved the performance by parallelizing lane detection algorithms. Lane detection, as a intellectual assisting system, helps drivers make an alarm sound or revise the handle in response of lane departure. Four kinds of algorithms are implemented in order as following, Gaussian filtering algorithm so as to remove the interferences, gray conversion algorithm to simplify images, sobel edge detection algorithm to find out the regions of lanes, and hough transform algorithm to detect straight lines. Among parallelized methods, the data level parallelism algorithm is easy to design, yet still problem with the bottleneck. The high-speed data level parallelism is suggested to reduce this bottleneck, which resulted in noticeable performance improvement. In the result of applying actual road video of black-box on our parallel algorithm, the measurement, in the case of single-core, is approximately 30 Frames/sec. Furthermore, in the case of octa-core parallelism, the data level performance is approximately 100 Frames/sec and the highest performance comes close to 150 Frames/sec.

A Vectorization Technique at Object Code Level (목적 코드 레벨에서의 벡터화 기법)

  • Lee, Dong-Ho;Kim, Ki-Chang
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.5
    • /
    • pp.1172-1184
    • /
    • 1998
  • ILP(Instruction Level Parallelism) processors use code reordering algorithms to expose parallelism in a given sequential program. When applied to a loop, this algorithm produces a software-pipelined loop. In a software-pipelined loop, each iteration contains a sequence of parallel instructions that are composed of data-independent instructions collected across from several iterations. For vector loops, however the software pipelining technique can not expose the maximum parallelism because it schedules the program based only on data-dependencies. This paper proposes to schedule differently for vector loops. We develop an algorithm to detect vector loops at object code level and suggest a new vector scheduling algorithm for them. Our vector scheduling improves the performance because it can schedule not only based on data-dependencies but on loop structure or iteration conditions at the object code level. We compare the resulting schedules with those by software-pipelining techniques in the aspect of performance.

  • PDF

Speculative Parallelism Characterization Profiling in General Purpose Computing Applications

  • Wang, Yaobin;An, Hong;Liu, Zhiqin;Li, Li;Yu, Liang;Zhen, Yilu
    • Journal of Computing Science and Engineering
    • /
    • v.9 no.1
    • /
    • pp.20-28
    • /
    • 2015
  • General purpose computing applications have not yet been thoroughly explored in procedure level speculation, especially in the light-weighted profiling way. This paper proposes a light-weighted profiling mechanism to analyze speculative parallelism characterization in several classic general purpose computing applications from SPEC CPU2000 benchmark. By comparing the key performance factors in loop and procedure-level speculation, it includes new findings on the behaviors of loop and procedure-level parallelism under these applications. The experimental results are as follows. The best gzip application can only achieve a 2.4X speedup in loop level speculation, while the best mcf application can achieve almost 3.5X speedup in procedure level. It proves that our light-weighted profiling method is also effective. It is found that between the loop-level and procedure-level TLS, the latter is better on several cases, which is against the conventional perception. It is especially shown in the applications where their 'hot' procedure body is concluded as 'hot' loops.

Exploiting Implicit Parallelism for Single Loops in Java Programming Language (JAVA 프로그래밍 언어에서 단일루프구조의 무시적 병렬성 검출)

  • Kwon, Oh-Jin
    • Journal of Information Management
    • /
    • v.29 no.3
    • /
    • pp.1-26
    • /
    • 1998
  • The loop is a fundamental for the parallelism exploiting as it has a large portion of execution time for sequential Java program on the parallel machine. This paper proposes the method of exploiting the implicit parallelism through the analysis of data dependence in the existing Java programming language having a single loop structure. The parallel code generation method through the restructuring compiler and the translation method of Java source program into multithread statement, which is supported in the level of the Java programming language, are also proposed here. The performance test of the program translated into the thread statement is conducted using the trip count of loop and the thread count as parameters. The restructuring compiler makes it possible for users to reduce overhead and exploit parallelism efficiently in the Java programming.

  • PDF

OpenGL ES 2.0 based Shader Compilation Method for the Instruction-Level Parallelism (OpenGL ES 2.0 기반 셰이더 명령어 병렬 처리를 위한 컴파일 기법)

  • Kim, Jong-Ho;Kim, Tae-Young
    • Journal of Korea Game Society
    • /
    • v.8 no.2
    • /
    • pp.69-76
    • /
    • 2008
  • In this paper, we present the architecture of graphics processor and its instruction format for the mobile device. In addition, we introduce tile shader data structure for the on/off-line compilation based on the OpenGL ES 2.0 and a new optimization method based on the ILP(Instruction-Level Parallelism). This paper shows where a processor with the sane core clock is being used, the shader instruction resulted from the compile structure and method in this paper is approximately 1.5 to 2 times faster than a code based on the single instruction.

  • PDF

Performance Evaluation of Value Predictor in High Performance Microprocessors (고성능 마이크로프로세서에서 값 예측기의 성능평가)

  • Jeon Byoung-Chan;Kim Hyeock-Jin;RU Dae-Hee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.10 no.2 s.34
    • /
    • pp.87-95
    • /
    • 2005
  • value prediction in high performance micro processors is a technique that exploits Instruction Level Parallelism(ILP) by predicting the outcome of an instruction and by breaking and executing true data dependences. In this paper, the mean Performance improvements by predictor according to a point of time for update of each table as well as prediction accuracy and Prediction rate are measured and assessed by comparison and analysis of value predictor that issues in parallel and run by predicting value, which is for Performance improvements of ILP in micro Processor. For the verification of its validity the SPECint95 benchmark through the simulation is compared by making use of execution driven system.

  • PDF

GPU-Based Acceleration of Quantum-Inspired Evolutionary Algorithm (GPU를 이용한 Quantum-Inspired Evolutionary Algorithm 가속)

  • Ryoo, Ji-Hyun;Park, Han-Min;Choi, Ki-Young
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.49 no.8
    • /
    • pp.1-9
    • /
    • 2012
  • Quantum-Inspired Evolutionary Algorithm(QEA) contains sufficient data-level parallelism to be naturally accelerated on GPUs. For an efficient reduction of execution time, however, careful task-mapping should be done to properly reflect the characteristics of CPU and GPU. Furthermore, when deciding which part of the application should run on GPU, we need to consider the data transfer between CPU and GPU memory spaces as well as the data-level parallelism. In addition, the usage of zero-copy host memory, proper choice of the execution configuration, and thread organization considering memory coalescing is important to further reduce the execution time. With all these techniques, we could run QEA 3.69 times faster on average in comparison with the multi-threading CPU for the case of 0-1 knapsack problem with 30,000 items.

Implementation and Verification of a Multi-Core Processor including Multimedia Specific Instructions (멀티미디어 전용 명령어를 내장한 멀티코어 프로세서 구현 및 검증)

  • Seo, Jun-Sang;Kim, Jong-Myon
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.8 no.1
    • /
    • pp.17-24
    • /
    • 2013
  • In this paper, we present a multi-core processor including multimedia specific instructions to process multimedia data efficiently in the mobile environment. Multimedia specific instructions exploit subword level parallelism (SLP), while the multi-core processor exploits data level parallelism (DLP). These combined parallelisms improve the performance of multimedia processing applications. The proposed multi-core processor including multimedia specific instructions is implemented and tested using a Xilinx ISE 10.1 tool and SoCMaster3 testbed system including Vertex 4 FPGA. Experimental results using a fire detection algorithm show that multimedia specific instructions outperform baseline instructions in the same multi-core architecture in terms of performance (1.2x better), energy efficiency (1.37x better), and area efficiency (1.23x better).