• Title/Summary/Keyword: Last Level Cache

Search Result 16, Processing Time 0.022 seconds

Performance evaluation and analysis of TILE-Gx36 many-core processor with PARSEC benchmark (PARSEC을 이용한 TILE-Gx36 다중코어 프로세서의 성능 평가 및 분석)

  • Lee, Boseon;Kim, Han-Yee;Yu, Heonchang;Suh, Taeweon
    • The Journal of Korean Association of Computer Education
    • /
    • v.17 no.1
    • /
    • pp.107-115
    • /
    • 2014
  • This paper evaluates and analyzes the performance of TILE-Gx36(Gx36), a many-core processor. The PARSEC parallel benchmark suite was used to measure the performance, and Core i7 (i7) and Atom are used for the performance comparison. When experimented with the maximum number of threads that can be executed concurrently on each machine, Gx36 showed a 2.73${\times}$ inferior performance to Core i7 and a 1.93${\times}$ superior performance to Atom. Gx36 has the largest Last Level Cache(LLC) among the compared processors. Nevertheless, it reported the biggest number of LLC misses, which, we strongly believe, is the major culprit for lower performance than expected. Our study suggests that the DDC employed in Gx36 is not a favorable cache structure for the general-purpose high-performance computing. The actual measurement with off-the-shelf machine provides non-biased data for polishing the future many-core architecture.

  • PDF

A New Cache Replacement Policy for Improving Last Level Cache Performance (라스트 레벨 캐쉬 성능 향상을 위한 캐쉬 교체 기법 연구)

  • Do, Cong Thuan;Son, Dong Oh;Kim, Jong Myon;Kim, Cheol Hong
    • Journal of KIISE
    • /
    • v.41 no.11
    • /
    • pp.871-877
    • /
    • 2014
  • Cache replacement algorithms have been developed in order to reduce miss counts. In modern processors, the performance gap between the processor and main memory has been increasing, creating a more important role for cache replacement policies. The Least Recently Used (LRU) policy is one of the most common policies used in modern processors. However, recent research has shown that the performance gap between the LRU and the theoretical optimal replacement algorithm (OPT) is large. Although LRU replacement has been proven to be adequate over and over again, the OPT/LRU performance gap is continuously widening as the cache associativity becomes large. In this study, we observed that there is a potential chance to improve cache performance based on existing LRU mechanisms. We propose a method that enhances the performance of the LRU replacement algorithm based on the access proportion among the lines in a cache set during a period of two successive replacement actions that make the final replacement action. Our experimental results reveals that the proposed method reduced the average miss rate of the baseline 512KB L2 cache by 15 percent when compared to conventional LRU. In addition, the performance of the processor that applied our proposed cache replacement policy improved by 4.7 percent over LRU, on average.

Analysis on the Performance Impact of Partitioned LLC for Heterogeneous Multicore Processors (이종 멀티코어 프로세서에서 분할된 공유 LLC가 성능에 미치는 영향 분석)

  • Moon, Min Goo;Kim, Cheol Hong
    • The Journal of Korean Institute of Next Generation Computing
    • /
    • v.15 no.2
    • /
    • pp.39-49
    • /
    • 2019
  • Recently, CPU-GPU integrated heterogeneous multicore processors have been widely used for improving the performance of computing systems. Heterogeneous multicore processors integrate CPUs and GPUs on a single chip where CPUs and GPUs share the LLC(Last Level Cache). This causes a serious cache contention problem inside the processor, resulting in significant performance degradation. In this paper, we propose the partitioned LLC architecture to solve the cache contention problem in heterogeneous multicore processors. We analyze the performance impact varying the LLC size of CPUs and GPUs, respectively. According to our simulation results, the bigger the LLC size of the CPU, the CPU performance improves by up to 21%. However, the GPU shows negligible performance difference when the assigned LLC size increases. In other words, the GPU is less likely to lose the performance when the LLC size decreases. Because the performance degradation due to the LLC size reduction in GPU is much smaller than the performance improvement due to the increase of the LLC size of the CPU, the overall performance of heterogeneous multicore processors is expected to be improved by applying partitioned LLC to CPUs and GPUs. In addition, if we develop a memory management technique that can maximize the performance of each core in the future, we can greatly improve the performance of heterogeneous multicore processors.

I/O Traffic based Task Classification for Shared Last Level Cache Utilization in NUMA Systems (NUMA 시스템의 공유 LLC 활용을 위한 I/O 트래픽에 따른 태스크 분류법)

  • An, Deukhyeon;Kim, Jihong;Eom, Young Ik
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.04a
    • /
    • pp.199-201
    • /
    • 2012
  • 디스크나 이더넷과 같은 I/O 장치로부터 발생하는 I/O 트래픽은, 여러 개의 노드를 가진 NUMA 시스템의 공유 LLC에 캐시 오염을 일으켜 캐시 라인이 재사용되는 것을 방해한다. 이러한 태스크는 캐시를 효율적으로 이용할 수 있는 메모리 집중적인 태스크들과 따로 분리하여 다룰 필요가 있다. 본 논문에서는 이러한 캐시 오염을 발생시키는 태스크들을 해당 태스크의 I/O 트래픽을 이용하여 실시간으로 감시하고 분류하는 기법을 제안한다. 또한 대량의 I/O 트래픽을 일으키는 태스크의 특성을 알아본다. 이를 통해, NUMA 시스템 환경에서 각 노드의 공유 LLC를 보다 효율적으로 사용할 수 있는 운영체제 스케줄링 기법을 연구하기 위한 토대를 마련하였다.

Limiting CPU Frequency Scaling Considering Main Memory Accesses (주메모리 접근을 고려한 CPU 주파수 조정 제한)

  • Park, Moonju
    • KIISE Transactions on Computing Practices
    • /
    • v.20 no.9
    • /
    • pp.483-491
    • /
    • 2014
  • Contemporary computer systems exploits DVFS (Dynamic Voltage/Frequency Scaling) technology for balancing performance and power consumption. The efficiency of DVFS depends on how much performance we get for larger power consumption due to elevated CPU frequency. Especially for memory-bounded applications, higher CPU frequency often does not result in higher performance. In this paper, we present an upper bound of CPU frequency scaling based on memory accesses. It is observed that the performance gain due to higher CPU frequency is limited by memory accesses (last level cache misses) per instructions by experiments. Using the results, we present the CPU frequency upper bound with little performance gain. Experimental results show that for a memory-bounded application, applying the frequency upper bound enhances the energy efficiency of the application by above 30%.

Exploiting Hardware Events to Reduce Energy Consumption of HPC Systems

  • Lee, Yongho;Kwon, Osang;Byeon, Kwangeun;Kim, Yongjun;Hong, Seokin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.8
    • /
    • pp.1-11
    • /
    • 2021
  • This paper proposes a novel mechanism called Event-driven Uncore Frequency Scaler (eUFS) to improve the energy efficiency of the HPC systems. UFS exploits the hardware events such as LAPI (Last-level Cache Accesses Per Instructions) and CPI (Clock Cycles Per Instruction) to dynamically adjusts the uncore frequency. Hardware events are collected at a reference time period, and the target uncore frequency is determined using the collected event and the previous uncore frequency. Experiments with the NPB benchmarks demonstrate that the eUFS reduces the energy consumption by 6% on average for class C and D NPB benchmarks while it only increases the execution time by 2% on average.