• Title/Summary/Keyword: data level parallelism

Search Result 47, Processing Time 0.026 seconds

An On-chip Multiprocessor Miroprocessor with Shared MMU and Cache

  • Lee, Yong-Hwan;Jeong, Woo-Kyeong;An, Sang-Jun;Lee, Yong-Surk
    • Journal of Electrical Engineering and information Science
    • /
    • v.2 no.4
    • /
    • pp.1-7
    • /
    • 1997
  • A multiprocessor microprocessor named SMPC(scaleable multiprocessor chip) that contains tow IU (integer unit) is presented in this paper. It can execute multiple instructions from several tasks exploiting task-level parallelism that is free from instruction dependencies, and provide high performance and throughput on both single program and multiprogramming environments. the IU is a 32-bit scalar processor expecially designed to boost up the performance of string manipulations which are frequently used in RDBMS(relational data base management system) applications. A memory management unit and a data cache shared by two IUs improve the performance and reduce the chip area required. ETH SMPC is implemented in VLSI circuit by custom design and automated design tools.

  • PDF

Implementation of Multi-Core Processor for Beamforming Algorithm of Mobile Ultrasound Image Signals (모바일 초음파 영상신호의 빔포밍 알고리즘을 위한 멀티코어 프로세서 구현)

  • Choi, Byong-Kook;Kim, Jong-Myon
    • The KIPS Transactions:PartA
    • /
    • v.18A no.2
    • /
    • pp.45-52
    • /
    • 2011
  • In the past, a patient went to the room where an ultrasound image diagnosis device was set, and then he or she was examined by a doctor. However, currently a doctor can go and examine the patient with a handheld ultrasound device who stays in a room. However, it was implemented with only fundamental functions, and can not meet the high performance required by the focusing algorithm of ultrasound beam which determines the quality of ultrasound image. In addition, low energy consumption was satisfied for the mobile ultrasound device. To satisfy these requirements, this paper proposes a high-performance and low-power single instruction, multiple data (SIMD) based multi-core processor that supports a representative beamforming algorithm out of several focusing methods of mobile ultrasound image signals. The proposed SIMD multi-core processor, which consists of 16 processing elements (PEs), satisfies the high-performance required by the beamforming algorithm by exploiting considerable data-level parallelism inherent in the echo image data of ultrasound. Experimental results showed that the proposed multi-core processor outperforms a commercial high-performance processor, TI DSP C6416, in terms of execution time (15.8 times better), energy efficiency (6.9 times better), and area efficiency (10 times better).

A Hybrid Value Predictor using Static and Dynamic Classification in Superscalar Processors (슈퍼스칼라 프로세서에서 정적 및 동적 분류를 사용한 혼합형 결과 값 예측기)

  • 김주익;박홍준;조영일
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.30 no.10
    • /
    • pp.569-578
    • /
    • 2003
  • Data dependencies are one of major hurdles to limit ILP(Instruction Level Parallelism), so several related works have suggested that the limit imposed by data dependencies can be overcome to some extent with use of the data value prediction. Hybrid value predictor can obtain the high prediction accuracy using advantages of various predictors, but it has a defect that same instruction has overlapping entries in all predictor. In this paper, we propose a new hybrid value predictor which achieves high performance by using the information of static and dynamic classification. The proposed predictor can enhance the prediction accuracy and efficiently decrease the prediction table size of predictor, because it allocates each instruction into single best-suited predictor during the fetch stage by using the information of static classification. Also, it can enhance the prediction accuracy because it selects a best- suited prediction method for the “Unknown”pattern instructions by using the dynamic classification mechanism. Simulation results based on the SimpleScalar/PISA tool set and the SPECint95 benchmarks show the average correct prediction rate of 85.1% by using the static classification mechanism. Also, we achieve the average correction prediction rate of 87.6% by using static and dynamic classification mechanism.

Trend Comparison of Repeated Measures Data between Two Groups (반복측정 자료에서 개체기울기를 이용한 집단간의 차이 검정법)

  • Hwang, Kum-Na;Kim, Dong-Jae
    • The Korean Journal of Applied Statistics
    • /
    • v.19 no.3
    • /
    • pp.565-578
    • /
    • 2006
  • Repeated measurement data between two group is often used in the field of medicine study. In this paper, we suggest a method for comparison of the trend between two groups based on repeated measurement data. First, we estimate regression coefficient of linear regression model from each subject and generate samples using the regression coefficient estimated previous. And then, we test the difference between two groups by unpaired t-test, Wilcoxon rank sum test and placement test using generated samples. Monte Carlo Simulation is adapted to examine the power and experimental significance levels of several methods in various combinations.

Parallel Method for HEVC Deblocking Filter based on Coding Unit Depth Information (코딩 유닛 깊이 정보를 이용한 HEVC 디블록킹 필터의 병렬화 기법)

  • Jo, Hyun-Ho;Ryu, Eun-Kyung;Nam, Jung-Hak;Sim, Dong-Gyu;Kim, Doo-Hyun;Song, Joon-Ho
    • Journal of Broadcast Engineering
    • /
    • v.17 no.5
    • /
    • pp.742-755
    • /
    • 2012
  • In this paper, we propose a parallel deblocking algorithm to resolve workload imbalance when the deblocking filter of high efficiency video coding (HEVC) decoder is parallelized. In HEVC, the deblocking filter which is one of the in-loop filters conducts two-step filtering on vertical edges first and horizontal edges later. The deblocking filtering can be conducted with high-speed through data-level parallelism because there is no dependency between adjacent edges for deblocking filtering processes. However, workloads would be imbalanced among regions even though the same amount of data for each region is allocated, which causes performance loss of decoder parallelization. In this paper, we solve the problem for workload imbalance by predicting the complexity of deblocking filtering with coding unit (CU) depth information at a coding tree block (CTB) and by allocating the same amount of workload to each core. Experimental results show that the proposed method achieves average time saving (ATS) by 64.3%, compared to single core-based deblocking filtering and also achieves ATS by 6.7% on average and 13.5% on maximum, compared to the conventional uniform data-level parallelism.

Design of a Parallel Pipelined Processor Architecture (병렬 파이프라인 프로세서 아키덱처의 설계)

  • 이상정;김광준
    • Journal of the Korean Institute of Telematics and Electronics B
    • /
    • v.32B no.3
    • /
    • pp.11-23
    • /
    • 1995
  • In this paper, a parallel pipelined processor model which acts as a small VLIW processor architecture and a scheduling algorithm for extracting instruction-level parallelism on this architecture are proposed. The proposed model has a dual-instruction mode which has maximum 4 basic operations being executed in parallel. By combining these basic operations, variable instruction set can be designed for various applications. The scheduling algorithm schedules basic operations for parallel execution and removes pipeline hazards by examining data dependency and resource conflict relations. In order to examine operation and evaluate the performance,a C compiler and a simulator are developed. By simulating various test programs with the compiler and the simulator, the characteristics and the performance result of the proposed architecture are measured.

  • PDF

Workload Characteristics-based L1 Data Cache Switching-off Mechanism for GPUs

  • Do, Thuan Cong;Kim, Gwang Bok;Kim, Cheol Hong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.10
    • /
    • pp.1-9
    • /
    • 2018
  • Modern graphics processing units (GPUs) have become one of the most attractive platforms in exploiting high thread level parallelism with the support of new programming tools such as CUDA and OpenCL. Recent GPUs has applied cache hierarchy to support irregular memory access patterns; however, L1 data cache (L1D) exhibits poor efficiency in the GPU. This paper shows that the L1D does not always positively affect the applications in terms of performance and energy efficiency for the GPU. The performance of the GPU is even harmed by using the L1D for lots of applications. Our proposed technique exploits the characteristics of the currently-executed applications to predict the performance impact of the L1D on the GPU and then decides whether to continuously use the cache for the application or not. Our experimental results show that the proposed technique improves the GPU performance by 9.4% and saves up to 52.1% of the power consumption in the L1D.

Performance Analysis of Value Predictor considering instruction issue width in Superscalar processor (슈퍼스칼라 프로세서에서 명령어 이슈 길이를 고려한 값 예측기의 성능분석)

  • Jean Byoung-Chan;Kim Hyeock-Jin
    • Journal of the Korea Computer Industry Society
    • /
    • v.7 no.3
    • /
    • pp.171-178
    • /
    • 2006
  • Value prediction of instruction issue width in superscalar processor is a technique to obtain performance gains by supplying earlier source values of its data dependent instructions using predicted value of a instruction. In this paper, the mean performance improvement by predictor as well as prediction accuracy and prediction rate are moaned and assessed by comparison and analysis of value predictor that instruction issue width(4,8,16) in parallel and run by predicting value, which is for performance improvements of ILP[4]. The experiment result show the superiority hight performance of 8-issue.

  • PDF

AS B-tree: A study on the enhancement of the insertion performance of B-tree on SSD (AS B-트리: SSD를 사용한 B-트리에서 삽입 성능 향상에 관한 연구)

  • Kim, Sung-Ho;Roh, Hong-Chan;Lee, Dae-Wook;Park, Sang-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.18D no.3
    • /
    • pp.157-168
    • /
    • 2011
  • Recently flash memory has been being utilized as a main storage device in mobile devices, and flashSSDs are getting popularity as a major storage device in laptop and desktop computers, and even in enterprise-level server machines. Unlike HDDs, on flash memory, the overwrite operation is not able to be performed unless it is preceded by the erase operation to the same block. To address this, FTL(Flash memory Translation Layer) is employed on flash memory. Even though the modified data block is overwritten to the same logical address, FTL writes the updated data block to the different physical address from the previous one, mapping the logical address to the new physical address. This enables flash memory to avoid the high block-erase cost. A flashSSD has an array of NAND flash memory packages so it can access one or more flash memory packages in parallel at once. To take advantage of the internal parallelism of flashSSDs, it is beneficial for DBMSs to request I/O operations on sequential logical addresses. However, the B-tree structure, which is a representative index scheme of current relational DBMSs, produces excessive I/O operations in random order when its node structures are updated. Therefore, the original b-tree is not favorable to SSD. In this paper, we propose AS(Always Sequential) B-tree that writes the updated node contiguously to the previously written node in the logical address for every update operation. In the experiments, AS B-tree enhanced 21% of B-tree's insertion performance.

A Hybrid Value Predictor using Speculative Update of the Predictor Table and Static Classification for the Pattern of Executed Instructions in Superscalar Processors (슈퍼스칼라 프로세서에서 예상 테이블의 모험적 갱신과 명령어 실행 유형의 정적 분류를 이용한 혼합형 결과값 예측기)

  • Park, Hong-Jun;Jo, Young-Il
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.8 no.1
    • /
    • pp.107-115
    • /
    • 2002
  • We propose a new hybrid value predictor which achieves high performance by combining several predictors. Because the proposed hybrid value predictor can update the prediction table speculatively, it efficiently reduces the number of mispredicted instructions due to stale data. Also, the proposed predictor can enhance the prediction accuracy and efficiently decrease the hardware cost of predictor, because it allocates instructions into the best-suited predictor during instruction fetch stage by using the information of static classification which is obtained from the profile-based compiler implementation. For the 16-issue superscalar processors, simulation results based on the SimpleScalar/PISA tool set show that we achieve the average prediction rates of 73% by using speculative update and the average prediction rates of 88% by adding static classification for the SPECint95 benchmark programs.