• Title/Summary/Keyword: Benchmarks

Search Result 375, Processing Time 0.022 seconds

Compact Field Remapping for Dynamically Allocated Structures (동적으로 할당된 구조체를 위한 압축된 필드 재배치)

  • Kim, Jeong-Eun;Han, Hwan-Soo
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.10
    • /
    • pp.1003-1012
    • /
    • 2005
  • The most significant difference of embedded systems from general purpose systems is that embedded systems are allowed to use only limited resources including battery and memory. Especially, the number of applications increases which deal with multimedia data. In those systems with high data computations, the delay of memory access is one of the major bottlenecks hurting the system performance. As a result, many researchers have investigated various techniques to reduce the memory access cost. Most programs generally have locality in memory references. Temporal locality of references means that a resource accessed at one point will be used again in the near future. Spatial locality of references is that likelihood of using a resource gets higher if resources near it were just accessed. The latest embedded processors usually adapt cache memory to exploit these two types of localities. Processors access faster cache memory than off-chip memory, reducing the latency. In this paper we will propose the enhanced dynamic allocation technique for structure-type data in order to eliminate unused memory space and to reduce both the cache miss rate and the application execution time. The proposed approach aggregates fields from multiple records dynamically allocated and consecutively remaps them on the memory space. Experiments on Olden benchmarks show $13.9\%$ L1 cache miss rate drop and $15.9\%$ L2 cache miss drop on average, compared to the previously proposed techniques. We also find execution time reduced by $10.9\%$ on average, compared to the previous work.

Remote Cache Replacement Policy using Processor Locality in Multi-Processor System (다중 프로세서 시스템에서 프로세서 지역성을 이용한 원격 캐쉬 교체 정책)

  • Han Sang Yoon;Kwak Jong Wook;Jhang Seong Tae;Jhon Chu Shik
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.541-556
    • /
    • 2005
  • The memory access latency of the system has been a primary factor of performance degradation in single-processor system and multi-processor system. The remote memory access latency takes a lot of overhead over the local memory access latency especially in the distributed shared-memory system. To resolve this problem, the multi-level cache architecture that contains a remote cache in the multi-processor system has been proposed. In this paper, we propose a new cache replacement policy that improves the performance of the multi-processor system with the remote cache. If the multi-level cache keeps the multi-level inclusion(MLI) property and uses the LRU(Least Recently Used) cache replacement policy, the LRU information of the higher-level cache(a processor cache) would be different with that of the lower-level cache(a remote cache). In this situation, the replacement of a remote cache line can induce the exchange of a processor cache line that is used by the processor. It is a main factor of performance degradation in a whole system. To alleviate this disadvantage of the LRU replacement polity, the new policy analyses tht processor's remote memory access pattern of each node and uses this information to reduce the number of invalidations of the useful cache line in the higher-level cache. The new replacement policy of the remote cache can improve the performance by $3.5\%$ in maximum and $2.5\%$ in average on SPLASH-2 benchmarks, compared to the general LRU cache replacement policy.

A Study on the Prediction Accuracy Bounds of Instruction Prefetching (명령어 선인출 예측 정확도의 한계에 관한 연구)

  • Kim, Seong-Baeg;Min, Sang-Lyul;Kim, Chong-Sang
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.27 no.8
    • /
    • pp.719-729
    • /
    • 2000
  • Prefetching aims at reducing memory latency by fetching, in advance, data that are likely to be requested by the processor in a near future. The effectiveness of prefetching is determined by how accurate the prediction on the needed instructions and data is. Most previous studies on prefetching were limited to proposing a particular prefetch scheme and its performance evaluation, paying little attention to theoretical aspects of prefetching. This paper focuses on the theoretical aspects of instruction prefetching. For this purpose, we propose a clairvoyant prefetch model that makes use of perfect history information. Based on this theoretical model, we analyzed upper limits on the prefetch prediction accuracies of the SPEC benchmarks. The results show that the prefetch prediction accuracy is very high when there is no cache. However, as the size of the instruction cache increases, the prefetch prediction accuracy drops drastically. For example, in the case of the spice benchmark, the prefetch prediction accuracy drops from 53% to 39% when the cache size increases from 2Kbyte to 16Kbyte (assuming 16byte block size). These results indicate that as the cache size increases, most localities are captured by the cache and that instruction prefetching based on the information extracted from the references that missed in the cache suffers from prediction inaccuracies

  • PDF

MOC: A Multiple-Object Clustering Scheme for High Performance of Page-out in BSD VM (MOC: 다중 오브젝트 클러스터링을 통한 BSD VM의 페이지-아웃 성능 향상)

  • Yang, Jong-Cheol;Ahn, Woo-Hyun;Oh, Jae-Won
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.36 no.6
    • /
    • pp.476-487
    • /
    • 2009
  • The virtual memory system in 4.4 BSD operating systems exploits a clustering scheme to reduce disk I/Os in paging out (or flushing) modified pages that are intended to be replaced in order to make free rooms in memory. Upon the page out of a victim page, the scheme stores a cluster (or group) of modified pages contiguous with the victim in the virtual address space to swap disk at a single disk write. However, it fails to find large clusters of contiguous pages if applications change pages not adjacent with each other in the virtual address space. To address the problem, we propose a new clustering scheme called Multiple-Object Clustering (MOC), which together stores multiple clusters in the virtual address space at a single disk write instead of paging out the clusters to swap space at separate disk I/Os. This multiple-cluster transfer allows the virtual memory system to significantly decrease disk writes, thus improving the page-out performance. Our experiments in the FreeBSD 6.2 show that MOC improves the execution times of realistic benchmarks such as NS2, Scimark2 SOR, and nbench LU over the traditional clustering scheme ranging from 9 to 45%.

APC: An Adaptive Page Prefetching Control Scheme in Virtual Memory System (APC: 가상 메모리 시스템에서 적응적 페이지 선반입 제어 기법)

  • Ahn, Woo-Hyun;Yang, Jong-Cheol;Oh, Jae-Won
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.37 no.3
    • /
    • pp.172-183
    • /
    • 2010
  • Virtual memory systems (VM) reduce disk I/Os caused by page faults using page prefetching, which reads pages together with a desired page at a page fault in a single disk I/O. Operating systems including 4.4BSD attempt to prefetch as many pages as possible at a page fault regardless of page access patterns of applications. However, such an approach increases a disk access time taken to service a page fault when a high portion of the prefetched pages is not referenced. More seriously, the approach can cause the memory pollution, a problem that prefetched pages not to be accessed evict another pages that will be referenced soon. To solve these problems, we propose an adaptive page prefetching control scheme (APC), which periodically monitors access patterns of prefetched pages in a process unit. Such a pattern is represented as the ratio of referenced pages among prefetched ones before they are evicted from memory. Then APC uses the ratio to adjust the number of pages that 4.4BSD VM intends to prefetch at a page fault. Thus APC allows 4.4BSD VM to prefetch a proper number of pages to have a better effect on reducing disk I/Os, though page access patterns of an application vary in runtime. The experiment of our technique implemented in FreeBSD 6.2 shows that APC improves the execution times of SOR, SMM, and FFT benchmarks over 4.4BSD VM by up to 57%.

Design and Implementation of an In-Memory File System Cache with Selective Compression (대용량 파일시스템을 위한 선택적 압축을 지원하는 인-메모리 캐시의 설계와 구현)

  • Choe, Hyeongwon;Seo, Euiseong
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.658-667
    • /
    • 2017
  • The demand for large-scale storage systems has continued to grow due to the emergence of multimedia, social-network, and big-data services. In order to improve the response time and reduce the load of such large-scale storage systems, DRAM-based in-memory cache systems are becoming popular. However, the high cost of DRAM severely restricts their capacity. While the method of compressing cache entries has been proposed to deal with the capacity limitation issue, compression and decompression, which are technically difficult to parallelize, induce significant processing overhead and in turn retard the response time. A selective compression scheme is proposed in this paper for in-memory file system caches that rapidly estimates the compression ratio of incoming cache entries with their Shannon entropies and compresses cache entries with low compression ratio. In addition, a description is provided of the design and implementation of an in-kernel in-memory file system cache with the proposed selective compression scheme. The evaluation showed that the proposed scheme reduced the execution time of benchmarks by approximately 18% in comparison to the conventional non-compressing in-memory cache scheme. It also provided a cache hit ratio similar to the all-compressing counterpart and reduced 7.5% of the execution time by reducing the compression overhead. In addition, it was shown that the selective compression scheme can reduce the CPU time used for compression by 28% compared to the case of the all-compressing scheme.

Lightweight Loop Invariant Code Motion for Java Just-In-Time Compiler on Itanium (Itanium상의 자바 적시 컴파일러를 위한 가벼운 루프 불변 코드 이동)

  • Yu Jun-Min;Choi Hyung-Kyu;Moon Soo-Mook
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.3
    • /
    • pp.215-226
    • /
    • 2005
  • Loop invariant code motion (LICM) optimization includes relatively heavy code analyses, thus being not readily applicable to Java Just-In-Time (JIT) compilation where the JIT compilation time is part of the whole running time. 'Classical' LICM optimization first analyzes the code and constructs both the def-use chains and the use-def chains. which are then used for performing code motions. This paper proposes a light-weight LICM algorithm, which requires only the def-use chains of loop invariant code (without use-def chains) by exploiting the fact that the Java virtual machine is based on a stack machine, hence generating code with simpler patterns. We also propose two techniques that allow more code motions than classical LICM techniques. First, unlike previous JIT techniques that uses LICM only in single-path loops for simplicity, we apply LICM to multi-path loops (natural loops) safely for partially redundant code. Secondly, we move loop-invariant, partially-redundant null pointer check code via predication support in Itanium. The proposed techniques were implemented in a JIT compiler for Itanium processor on ORP (Open Runtime Platform) Java virtual machine of Intel. On SPECjvrn98 benchmarks, the proposed technique increases the JIT compilation overhead by the geometric mean of 1.3%, yet it improves the total running time by the geometric mean of 2.2%.

Benchmarking Ascension Prospects for the Gwangyang Port as a Hub for International Logistics (국제물류허브를 위한 광양항의 벤치마킹 중대방안)

  • Jang, Heung-Hoon;Fawson, Chris
    • Journal of Korea Port Economic Association
    • /
    • v.25 no.1
    • /
    • pp.87-106
    • /
    • 2009
  • This paper is intended to suggest benchmarking ascension for the Gwangyang Port as a hub for international logistics. Most countries that seek to join. and lead, the global trading system as they work to develop production and logistics systems that establish a reputation for leadership in international logistics. Our focus in this research is on the Gwangyang Port and whether Gwangyang Port is capable of carving out a competitive niche as a hub of international logistics. Our analysis is based on comparison and analysis with benchmark port developments around the world. As proposals to promote and activate Gwangyang Port as a hub for international logistics, we recommended in this paper several benchmarks. First, Gwangyang Port FTZ must strengthen the incentive system for tenant companies and providing an inducement for new global companies. Second, Gwangyang Port needs to moderation of regulation on the investment tenant companies and strengthening one-stop service. Third, it is required to stabilize labor and management relationship and securing of flexibility of labor market. Lastly, Gwangyang Port must strengthen mutual interaction of Free Economic Zone (FEZ), Customs Free Zone(CFZ) and Free trade Zone(FTZ) in Korea.

  • PDF

A Hybrid Value Predictor using Static and Dynamic Classification in Superscalar Processors (슈퍼스칼라 프로세서에서 정적 및 동적 분류를 사용한 혼합형 결과 값 예측기)

  • 김주익;박홍준;조영일
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.30 no.10
    • /
    • pp.569-578
    • /
    • 2003
  • Data dependencies are one of major hurdles to limit ILP(Instruction Level Parallelism), so several related works have suggested that the limit imposed by data dependencies can be overcome to some extent with use of the data value prediction. Hybrid value predictor can obtain the high prediction accuracy using advantages of various predictors, but it has a defect that same instruction has overlapping entries in all predictor. In this paper, we propose a new hybrid value predictor which achieves high performance by using the information of static and dynamic classification. The proposed predictor can enhance the prediction accuracy and efficiently decrease the prediction table size of predictor, because it allocates each instruction into single best-suited predictor during the fetch stage by using the information of static classification. Also, it can enhance the prediction accuracy because it selects a best- suited prediction method for the “Unknown”pattern instructions by using the dynamic classification mechanism. Simulation results based on the SimpleScalar/PISA tool set and the SPECint95 benchmarks show the average correct prediction rate of 85.1% by using the static classification mechanism. Also, we achieve the average correction prediction rate of 87.6% by using static and dynamic classification mechanism.

Performance Improvement of Force-directed Partitioning Algorithm for HW/SW Codesign (하드웨어/소프트웨어 통합설계를 위한 FDS 분할 알고리즘의 성능개선)

  • Oh, Ju-Young;Lee, Myoun-Jae;Lee, Jun-Yong;Park, Do-Soon
    • The KIPS Transactions:PartA
    • /
    • v.9A no.4
    • /
    • pp.491-496
    • /
    • 2002
  • Most partitioning algorithms for hardware- software codesign do not consider scheduling. Therefore, partitioning should be performed again if time constraints art not satisfied in scheduling the partitioned results. Existing FDS-applied methods which consider scheduling in partitioning decide the control step of the node to schedule while selecting nodes for partitioning. In selecting nodes for partitioning, several aspects should be considered together such as added cost or time due to the partition of the node, or the degree of interference due to the scheduling of the node. At this time, the induced force, which means the degree of intereference of scheduling other nodes, is computed all over the control step of the corresponding node and other depending nodes. In this paper, a new FDS-applied partitioning algorithm is proposed, where partitioning is performed using the defined scheduling urgency and relative scheduling urgency of the nodes. Since the nodes are partitioned by the computation of relative scheduling urgencies only at the earliest control step and the latest control step among the assignable steps, the time complexity for the computation of induced force could be improve. Experimental result on the benchmarks show the improvement of execution time of the proposed algorithm compared to the existing FDS-applied methods.