Search | Korea Science

Chon, Young-Suk
- Journal of KIISE:Computer Systems and Theory
- /
- v.33 no.11
- /
- pp.836-852
- /
- 2006
This paper proposes an L1 cache prefetch scheme using an excessively aggressive hardware prefetcher and a hardware prefetch filter having a small direct-mapped filtering cache. A quantitative analysis method has been introduced and applied to analyze nonideal effects of aggressive cache prefetching. From those analysis results, the structure and algorithm of a prefetch filter has been derived and simulated, and the overall system performance has been measured using a cycle-by-cycle cache simulator. Experimental results show that the proposed scheme improves the overall system performance by 18% on the average over several benchmarks
PDF KSCI

Jeong, Yong Su;Kim, JinHyuk;Cho, Tae Hwan;Choi, SangBang
- Journal of the Institute of Electronics and Information Engineers
- /
- v.52 no.10
- /
- pp.82-94
- /
- 2015
In this paper, we propose hardware prefetch mechanism with an efficient cache replacement policy by giving priority to the trigger block in which a spatial region and producing a spatial region by using the displacement field. It could be taken into account the sequence of the program since a history is based on the trigger block of history record, and it could be quickly prefetching the instructions or data address by adding a stored value to the trigger address and displacement field since a history is stored as a displacement value. Also, we proposed a method of replacing at random by the cache replacement policy from the low priority block when the cache area is full after giving priority to the trigger block. We analyzed using the memory simulator program gem5 and PARSEC benchmark to assess the performance of the hardware prefetcher. As a result, compared to the existing hardware prefecture to generate the spatial region using a bit vector, L1 data cache miss rate was reduced about 44.5% on average and an average of 26.1% of L1 instruction misses occur. In addition, IPC (Instruction Per Cycle) showed an improvement of about 23.7% on average.
https://doi.org/10.5573/ieie.2015.52.10.082 인용 PDF KSCI

Kim, KangHee;Park, TaeShin;Song, KyungHwan;Yoon, DongSung;Choi, SangBang
- Journal of the Institute of Electronics and Information Engineers
- /
- v.53 no.12
- /
- pp.20-35
- /
- 2016
In this paper, in order to reduce the delay and area of the partial product accumulation (PPA) of the parallel decimal multiplier, a tree architecture that composed by multi-operand decimal CSAs and improved CLA is proposed. The proposed tree using multi-operand CSAs reduces the partial product quickly. Since the input range of the recoder of CSA is limited, CSA can get the simplest logic. In addition, using the multi-operand decimal CSAs to add decimal numbers that have limited range in specific locations of the specific architecture can reduce the partial products efficiently. Also, final BCD result can be received faster by improving the logic of the decimal CLA. In order to evaluate the performance of the proposed partial product accumulation, synthesis is implemented by using Design Complier with 180 nm COMS technology library. Synthesis results show the delay of the proposed partial product accumulation is reduced by 15.6% and area is reduced by 16.2% comparing with which uses general method. Also, the total delay and area are still reduced despite the delay and area of the CLA are increased.
https://doi.org/10.5573/ieie.2016.53.12.020 인용 PDF KSCI