Search | Korea Science

Scalable Application Mapping for SIMD Reconfigurable Architecture

Kim, Yongjoo;Lee, Jongeun;Lee, Jinyong;Paek, Yunheung
- JSTS:Journal of Semiconductor Technology and Science
- /
- v.15 no.6
- /
- pp.634-646
- /
- 2015
Coarse-Grained Reconfigurable Architecture (CGRA) is a very promising platform that provides fast turn-around-time as well as very high energy efficiency for multimedia applications. One of the problems with CGRAs, however, is application mapping, which currently does not scale well with geometrically increasing numbers of cores. To mitigate the scalability problem, this paper discusses how to use the SIMD (Single Instruction Multiple Data) paradigm for CGRAs. While the idea of SIMD is not new, SIMD can complicate the mapping problem by adding an additional dimension of iteration mapping to the already complex problem of operation and data mapping, which are all interdependent, and can thus significantly affect performance through memory bank conflicts. In this paper, based on a new architecture called SIMD reconfigurable architecture, which allows SIMD execution at multiple levels of granularity, we present how to minimize bank conflicts considering all three related sub-problems, for various RA organizations. We also present data tiling and evaluate a conflict-free scheduling algorithm as a way to eliminate bank conflicts for a certain class of mapping problem.
https://doi.org/10.5573/JSTS.2015.15.6.634 인용 PDF KSCI

Design and Performance Analysis of High Performance Processor-Memory Integrated Architectures (고성능 프로세서-메모리 혼합 구조의 설계 및 성능 분석)

Kim, Young-Sik;Kim, Shin-Dug;Han, Tack-Don
- The Transactions of the Korea Information Processing Society
- /
- v.5 no.10
- /
- pp.2686-2703
- /
- 1998
The widening pClformnnce gap between processor and memory causes an emergence of the promising architecture, processor-memory (PM) integration In this paper, various design issues for P-M integration are studied, First, an analytical model of the DRAM access time is constructed considering both the bank conflict ratio and the DRAM page hit ratio. Then the points of both the performance improvement and the perfonnance bottle neck are found by the proposed model as designing on-chip DRAM architectures. This paper proposes the new architecture, called the delayed precharge bank architecture, to improve the perfonnance of memory system as increasing the DRAM page hit ratio. This paper also adapts an efficient bank interleaving mechanism to the proposed architecture. This architecture is verified !II he better than the hierarchical multi-bank architecture as well as the conventional bank architecture by executiun driven simulation. Eight SPEC95 benchmarks are used for simulation as changing parameters for the cache architecture, the number of DRAM banks, and the delayed time quantum.
PDF

Data Cache System based on the Selective Bank Algorithm for Embedded System (내장형 시스템을 위한 선택적 뱅크 알고리즘을 이용한 데이터 캐쉬 시스템)

Jung, Bo-Sung;Lee, Jung-Hoon
- The KIPS Transactions:PartA
- /
- v.16A no.2
- /
- pp.69-78
- /
- 2009
One of the most effective way to improve cache performance is to exploit both temporal and spatial locality given by any program executive characteristics. In this paper we present a high performance and low power cache structure with a bank selection mechanism that enhances exploitation of spatial and temporal locality. The proposed cache system consists of two parts, i.e., a main direct-mapped cache with a small block size and a fully associative buffer with a large block size as a multiple of the small block size. Especially, the main direct-mapped cache is constructed as two banks for low power consumption and stores a small block which is selected from fully associative buffer by the proposed bank selection algorithm. By using the bank selection algorithm and three state bits, We selectively extend the lifetime of those small blocks with high temporal locality by storing them in the main direct-mapped caches. This approach effectively reduces conflict misses and cache pollution at the same time. According to the simulation results, the average miss ratio, compared with the Victim and STAS caches with the same size, is improved by about 23% and 32% for Mibench applications respectively. The average memory access time is reduced by about 14% and 18% compared with the he victim and STAS caches respectively. It is also shown that energy consumption of the proposed cache is around 10% lower than other cache systems that we examine.
https://doi.org/10.3745/KIPSTA.2009.16-A.2.69 인용 PDF KSCI

Parallel Implementations of Digital Focus Indices Based on Minimax Search Using Multi-Core Processors

HyungTae, Kim;Duk-Yeon, Lee;Dongwoon, Choi;Jaehyeon, Kang;Dong-Wook, Lee
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.17 no.2
- /
- pp.542-558
- /
- 2023
A digital focus index (DFI) is a value used to determine image focus in scientific apparatus and smart devices. Automatic focus (AF) is an iterative and time-consuming procedure; however, its processing time can be reduced using a general processing unit (GPU) and a multi-core processor (MCP). In this study, parallel architectures of a minimax search algorithm (MSA) are applied to two DFIs: range algorithm (RA) and image contrast (CT). The DFIs are based on a histogram; however, the parallel computation of the histogram is conventionally inefficient because of the bank conflict in shared memory. The parallel architectures of RA and CT are constructed using parallel reduction for MSA, which is performed through parallel relative rating of the image pixel pairs and halved the rating in every step. The array size is then decreased to one, and the minimax is determined at the final reduction. Kernels for the architectures are constructed using open source software to make it relatively platform independent. The kernels are tested in a hexa-core PC and an embedded device using Lenna images of various sizes based on the resolutions of industrial cameras. The performance of the kernels for the DFIs was investigated in terms of processing speed and computational acceleration; the maximum acceleration was 32.6× in the best case and the MCP exhibited a higher performance.
https://doi.org/10.3837/tiis.2023.02.014 인용 PDF HTML

Search Result 4, Processing Time 0.018 seconds

Scalable Application Mapping for SIMD Reconfigurable Architecture

Design and Performance Analysis of High Performance Processor-Memory Integrated Architectures (고성능 프로세서-메모리 혼합 구조의 설계 및 성능 분석)

Data Cache System based on the Selective Bank Algorithm for Embedded System (내장형 시스템을 위한 선택적 뱅크 알고리즘을 이용한 데이터 캐쉬 시스템)

Parallel Implementations of Digital Focus Indices Based on Minimax Search Using Multi-Core Processors

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)