• Title/Summary/Keyword: Processor locality

Search Result 18, Processing Time 0.021 seconds

Cost-effective multistage interconnection network for UNMA model system (NUMA(non-uniform memory access) 모델 시스템을 위한 cost-effective한 다단계 상호연결망)

  • 최창훈;김성천
    • Journal of the Korean Institute of Telematics and Electronics C
    • /
    • v.34C no.5
    • /
    • pp.19-32
    • /
    • 1997
  • So far, the multiple path MINs to provide redundant paths in the traditional UPP MINs have been realized by adding additional hardware such as extra stages, duplicated data links, or multiple copies of sthe MIN. And the traditional MINs do not exploit locality: communication with all processor-memory paris takes the same amount of time. Also so far there has been little progress for exploiting locality of reference in MINs. In this paper, we present a new topology MIN, hybrid MIN that is constructed with 2N-3 SEs which is far fewer SEs than that of traditional MINs. Although the hybrid MIN is constructed with 2N-3 SEs, the hybrid MIN satisfies full access capability (FAC) and has redundant paths(but providing single path for 2 memory modules of each processor). Moreover the has redundant paths (but providing single path for 2 memory modules of each processor). Moreover the Hybrid MIN provides shortcut path between pairs which have frequent dat acommunication (locality of reference). Its performance under varing degrees of localized communication is analyzed.

  • PDF

Locality-Conscious Nested-Loops Parallelization

  • Parsa, Saeed;Hamzei, Mohammad
    • ETRI Journal
    • /
    • v.36 no.1
    • /
    • pp.124-133
    • /
    • 2014
  • To speed up data-intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.

Remote Cache Replacement Policy using Processor Locality in Multi-Processor System (다중 프로세서 시스템에서 프로세서 지역성을 이용한 원격 캐쉬 교체 정책)

  • Han Sang Yoon;Kwak Jong Wook;Jhang Seong Tae;Jhon Chu Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.541-556
    • /
    • 2005
  • The memory access latency of the system has been a primary factor of performance degradation in single-processor system and multi-processor system. The remote memory access latency takes a lot of overhead over the local memory access latency especially in the distributed shared-memory system. To resolve this problem, the multi-level cache architecture that contains a remote cache in the multi-processor system has been proposed. In this paper, we propose a new cache replacement policy that improves the performance of the multi-processor system with the remote cache. If the multi-level cache keeps the multi-level inclusion(MLI) property and uses the LRU(Least Recently Used) cache replacement policy, the LRU information of the higher-level cache(a processor cache) would be different with that of the lower-level cache(a remote cache). In this situation, the replacement of a remote cache line can induce the exchange of a processor cache line that is used by the processor. It is a main factor of performance degradation in a whole system. To alleviate this disadvantage of the LRU replacement polity, the new policy analyses tht processor's remote memory access pattern of each node and uses this information to reduce the number of invalidations of the useful cache line in the higher-level cache. The new replacement policy of the remote cache can improve the performance by $3.5\%$ in maximum and $2.5\%$ in average on SPLASH-2 benchmarks, compared to the general LRU cache replacement policy.

Cache Replacement Policy Based on Dynamic Counter for High Performance Processor (고성능 프로세서를 위한 카운터 기반의 캐시 교체 알고리즘)

  • Jung, Do Young;Lee, Yong Surk
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.50 no.4
    • /
    • pp.52-58
    • /
    • 2013
  • Replacement policy is one of the key factors determining the effectiveness of a cache. The LRU replacement policy has remained the standard for caches for many years. However, the traditional LRU has ineffective performance in zero-reuse line intensive workloads, although it performs well in high temporal locality workloads. To address this problem, We propose a new replacement policy; DCR(Dynamic Counter based Replacement) policy. A temporal locality of workload dynamically changes across time and DCR policy is based on the detection of these changing. DCR policy improves cache miss rate over a traditional LRU policy, by as much as 2.7% at maximum and 0.47% at average.

Memory Latency Hiding Techniques (메모리 지연을 감추는 기법들)

  • Ki, An-Do
    • Electronics and Telecommunications Trends
    • /
    • v.13 no.3 s.51
    • /
    • pp.61-70
    • /
    • 1998
  • The obvious way to make a computer system more powerful is to make the processor as fast as possible. Furthermore, adopting a large number of such fast processors would be the next step. This multiprocessor system could be useful only if it distributes workload uniformly and if its processors are fully utilized. To achieve a higher processor utilization, memory access latency must be reduced as much as possible and even more the remaining latency must be hidden. The actual latency can be reduced by using fast logic and the effective latency can be reduced by using cache. This article discusses what the memory latency problem is, how serious it is by presenting analytical and simulation results, and existing techniques for coping with it; such as write-buffer, relaxed consistency model, multi-threading, data locality optimization, data forwarding, and data prefetching.

Reliability Improvement of the Tag Bits of the Cache Memory against the Soft Errors (소프트 에러에 대한 캐쉬 메모리의 태그 비트 신뢰성 향상 기법)

  • Kim, Young-Ung
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.14 no.1
    • /
    • pp.15-21
    • /
    • 2014
  • Due to the development of manufacturing technology scaling, more transistors can be placed on a cache memories of a processor. However, processors become more vulnerable to the soft errors because of highly integrated transistors, the reliability of cache memory must consider seriously at the design level. Various researches are proposed to overcome the vulnerability of soft error, but researches of tag bit are proposed very rarely. In this paper, we revaluate the reliability improvement technique for tag bit, and analyse the protection rate of write-back operation, which is a typical case of not satisfying temporal locality. We also propose the methodology to improve the protection rate of write-back operation. The experiments of the proposed scheme shows up to 76.8% protection rate without performance degradations.

Low-Power Cache Design by using Locality Buffer and Address Compression (지역 버퍼와 주소 압축을 통한 저전력 캐시 설계)

  • Kwak, Jong Wook
    • Journal of the Korea Society of Computer and Information
    • /
    • v.18 no.9
    • /
    • pp.11-19
    • /
    • 2013
  • Most modern computer systems employ cache systems in order to alleviate the access time gap between processor and memory system. The power dissipated by the cache systems becomes a significant part of the total power dissipated by whole microprocessor chip. Therefore, power reduction in the cache system becomes one of the important issues. Partial tag cache is the system for the least power consumption. The main power reduction for this method is due to the use of small partial tag matching, not full tag matching. In this paper, we first analyze the previous regular partial tag cache systems and propose a new address matching mechanism by using locality buffer and address compression. In simulation results, the proposed model shows 18% power reduction in average, still providing same performance level, compared to regular cache.

A Low-Complexity Real-Time Barrel Distortion Correction Processor Combined with Color Demosaicking (컬러 디모자이킹이 결합된 저 복잡도의 실시간 배럴 왜곡 보정 프로세서)

  • Jeong, Hui-Seong;Park, Yun-Ju;Kim, Tae-Hwan
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.9
    • /
    • pp.57-66
    • /
    • 2014
  • This paper presents a low-complexity barrel distortion correction processor for wide-angle cameras. The proposed processor performs the barrel distortion correction jointly with the color demosaicking, so that the hardware complexity can be reduced significantly. In addition, to reduce the required memory bandwidth, an efficient memory interface is proposed by utilizing the spatial locality of the memory access in the correction process. The proposed processor is implemented with 35K logic gates in a $0.11-{\mu}m$ CMOS process and its correction speed is 150 Mpixels/s at the operating frequency of 606MHz, where the supported frame size is $2048{\times}2048$ and the required memory bandwidth is 1 read/cycle.

Design of A Media Processor Equipped with Dual Cache (복수 캐시로 구성한 미디어 프로세서의 설계)

  • Moon, Hyun-Ju;Jeon, Joong-Nam;Kim, Suk-Il
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.10
    • /
    • pp.573-581
    • /
    • 2002
  • In this paper, we propose a mediaprocessor of dual-cache architecture which is composed of the multimedia data cache and the general-purpose data cache to prevent performance degradation caused by memory delay. In the proposed processor architecture, multimedia data that are written in subword instructions are loaded in the multimedia data cache and the remaining data are loaded in the general-purpose data cache. Also, Ive use multi-block prefetching scheme that fetches two consecutive data blocks into a cache at a time to exploit the locality of multimedia data. Experimental results on MPEG and JPEG benchmark programs show that the proposed processor architecture results in better performance than the processor equipped with single data cache.

A Parallel Loop Scheduling Algorithm on Multiprocessor System Environments (다중프로세서 시스템 환경에서 병렬 루프 스케쥴링 알고리즘)

  • 이영규;박두순
    • Journal of Korea Multimedia Society
    • /
    • v.3 no.3
    • /
    • pp.309-319
    • /
    • 2000
  • The purpose of a parallel scheduling under a multiprocessor environment is to carry out the scheduling with the minimum synchronization overhead, and to perform load balance for a parallel application program. The processors calculate the chunk of iteration and are allocated to carry out the parallel iteration. At this time, it frequently accesses mutually exclusive global memory so that there are a lot of scheduling overhead and bottleneck imposed. And also, when the distribution of the parallel iteration in the allocated chunk to the processor is different, the different execution time of each chunk causes the load imbalance and badly affects the capability of the all scheduling. In the paper. we investigate the problems on the conventional algorithms in order to achieve the minimum scheduling overhead and load balance. we then present a new parallel loop scheduling algorithm, considering the locality of the data and processor affinity.

  • PDF