• Title/Summary/Keyword: Memory Bandwidth

Search Result 240, Processing Time 0.023 seconds

Study of Cache Performance on GPGPU

  • Choi, Kyu Hyun;Kim, Seon Wook
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.4 no.2
    • /
    • pp.78-82
    • /
    • 2015
  • General-purpose graphics processing units (GPGPUs) provide tremendous computational and processing power. Despite the latency hiding mechanism, a GPU architecture requires high memory bandwidth and lower latency between computational units and the memory system. For this reason, the current GPU architecture has private L1 caches in each core and a shared L2 cache to increase performance by reducing memory latency. But in some cases, this CPU-like cache design is not suitable for GPGPUs. In this paper, we analyze detailed cache performance related to GPGPU application characteristics, and suggest technical alternatives for the GPGPU architecture as future work.

Dual-Port SDRAM Optimization with Semaphore Authority Management Controller

  • Kim, Jae-Hwan;Chong, Jong-Wha
    • ETRI Journal
    • /
    • v.32 no.1
    • /
    • pp.84-92
    • /
    • 2010
  • This paper proposes the semaphore authority management (SAM) controller to optimize the dual-port SDRAM (DPSDRAM) in the mobile multimedia systems. Recently, the DPSDRAM with a shared bank enabling the exchange of data between two processors at high speed has been developed for mobile multimedia systems based on dual-processors. However, the latency of DPSDRAM caused by the semaphore for preventing the access contention at the shared bank slows down the data transfer rate and reduces the memory bandwidth. The methodology of SAM increases the data transfer rate by minimizing the semaphore latency. The SAM prevents the latency of reading the semaphore register of DPSDRAM, and reduces the latency of waiting for the authority of the shared bank to be changed. It also reduces the number of authority requests and the number of times authority changes. The experimental results using a 1 Gb DPSDRAM (OneDRAM) with the SAM controllers at 66 MHz show 1.6 times improvement of the data transfer rate between two processors compared with the traditional controller. In addition, the SAM shows bandwidth enhancement of up to 38% for port A and 31% for port B compared with the traditional controller.

Design and Performance Evaluation of a Flash Compression Layer for NAND-type Flash Memory Systems (NAND형 플래시메모리를 위한 플래시 압축 계층의 설계 및 성능평가)

  • Yim Keun Soo;Bahn Hyokyung;Koh Kern
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.4
    • /
    • pp.177-185
    • /
    • 2005
  • NAND-type flash memory is becoming increasingly popular as a large data storage for mobile computing devices. Since flash memory is an order of magnitude more expensive than magnetic disks, data compression can be effectively used in managing flash memory based storage systems. However, compressed data management in NAND-type flash memory is challenging because it supports only page-based I/Os. For example, when the size of compressed data is smaller than the page size. internal fragmentation occurs and this degrades the effectiveness of compression seriously. In this paper, we present an efficient flash compression layer (FCL) for NAND-type flash memory which stores several small compressed pages into one physical page by using a write buffer Based on prototype implementation and simulation studies, we show that the proposed scheme offers the storage of flash memory more than $140\%$ of its original size and expands the write bandwidth significantly.

Embedded Compression Codec Algorithm for Motion Compensated Wavelet Video Coding System (움직임 보상된 웨이블릿 기반의 비디오 코딩 시스템에 적용 가능한 임베디드 압축 코덱 알고리즘)

  • Kim, Song-Ju
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.3
    • /
    • pp.77-83
    • /
    • 2012
  • In this paper, a low-complexity embedded compression (EC) Codec algorithm for the wavelet video coder is applied to reduce excessive external memory requirements. The EC algorithm is used to achieve a fixed compression ratio of 50 % under the near-lossless-compression constraint. The EC technique can reduce the 50 % memory requirement for intermediate low-frequency coefficients during multiple discrete wavelet transform stages compared with direct implementation of the wavelet video encoder of this paper. Furthermore, the EC scheme based on a forward adaptive quantization and fixed length coding can save bandwidth and size of buffer between DWT and SPIHT to 50 %. Simulation results show that our EC algorithm present only PSNR degradation of 0.179 and 0.162 dB in average when the target bit-rate of the video coder are 1 and 0.5 bpp, respectively.

Design and Performance Evaluation of MIN for Nonuniform Traffic (비균등 트래픽을 위한 MIN의 설계 및 성능 평가)

  • Choe, Chang-Hun;Kim, Seong-Cheon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.37 no.6
    • /
    • pp.1-9
    • /
    • 2000
  • This paper presents a Cluster Oriented Multistage Interconnection Network called COMR. COMR can be constructed suitable for the parallel application with localized communication by providing the shortcut path inside the processor-memory cluster which has frequent data communication. We evaluate the performance of COMR with respect to probability of acceptance, bandwidth, cost-effectiveness and average distance under varying degrees of localized communication. According to the result of analysis for performance evaluation, COMR shows higher performance than the regular MINs of the same network size in the highly localized communication. In the worst case, the diameter of an N$\times$N COMR is only n+1 which has only one stage more as compared the MIN with the same network size. Therefore COMR can be used as an attractive interconnection network for parallel applications with not only the localized communication distribution but also the uniform distribution in shared-memory multiprocessor system.

  • PDF

Dynamic Bandwidth Distribution Method for High Performance Non-volatile Memory in Cloud Computing Environment (클라우드 환경에서 고성능 저장장치를 위한 동적 대역폭 분배 기법)

  • Kwon, Piljin;Ahn, Sungyong
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.20 no.3
    • /
    • pp.97-103
    • /
    • 2020
  • Linux Cgroups takes a fundamental role for sharing system resources among multiple containers on container-based cloud computing environment. Especially for I/O resource, Linux Cgroups supports a mechanism for sharing I/O bandwidth in proportion to I/O weight. However, the current mechanism of Linux Cgroups using BFQ I/O scheduler seriously degrades the I/O performance with high bandwidth storage device such as NVMe SSDs. In this paper, we proposed a new feedback based I/O bandwidth sharing scheme for Linux Cgroups which allocates I/O credits to containers according to I/O weights and adjusts the amount of credits to performance fluctuation of NVMe SSDs. The proposed scheme is implemented on Linux kernel 5.3 and evaluated. The evaluation results show that it can share the I/O bandwidth among multiple containers proportionally to I/O weights while improving I/O performance more than twice as high as the existing scheme.

Empirical Study of the Long-Term Memory Effect of the KOSPI200 Earning rate volatility (KOSPI200 수익률 변동성의 장기기억과정탐색)

  • Choi, Sang-Kyu
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.15 no.12
    • /
    • pp.7018-7024
    • /
    • 2014
  • This study examined the squared returns and absolute returns of KOSPI 200 with GPH (Geweke and Porter-Hudak, 1983) estimators. GPH was estimated by the long-term memory preserving time series parameter d in linear regression. This called the GPH estimator, which depends on a bandwidth m. m was decided by confirming the stable section of the point estimate by validating the track of the GPH estimator according to the value of m. The result suggests that by satisfying 0< d <0.5, the squared returns and absolute returns of KOPI 200 retains long-term memory.

Ethernet-Based Avionic Databus and Time-Space Partition Switch Design

  • Li, Jian;Yao, Jianguo;Huang, Dongshan
    • Journal of Communications and Networks
    • /
    • v.17 no.3
    • /
    • pp.286-295
    • /
    • 2015
  • Avionic databuses fulfill a critical function in the connection and communication of aircraft components and functions such as flight-control, navigation, and monitoring. Ethernet-based avionic databuses have become the mainstream for large aircraft owning to their advantages of full-duplex communication with high bandwidth, low latency, low packet-loss, and low cost. As a new generation aviation network communication standard, avionics full-duplex switched ethernet (AFDX) adopted concepts from the telecom standard, asynchronous transfer mode (ATM). In this technology, the switches are the key devices influencing the overall performance. This paper reviews the avionic databus with emphasis on the switch architecture classifications. Based on a comparison, analysis, and discussion of the different switch architectures, we propose a new avionic switch design based on a time-division switch fabric for high flexibility and scalability. This also merges the design concept of space-partition switch fabric to achieve reliability and predictability. The new switch architecture, called space partitioned shared memory switch (SPSMS), isolates the memory space for each output port. This can reduce the competition for resources and avoid conflicts, decrease the packet forwarding latency through the switch, and reduce the packet loss rate. A simulation of the architecture with optimized network engineering tools (OPNET) confirms the efficiency and significant performance improvement over a classic shared memory switch, in terms of overall packet latency, queuing delay, and queue size.

Design of Optimized SWAP System for Next-Generation Storage Devices (차세대 저장 장치에 최적화된 SWAP 시스템 설계)

  • Han, Hyuck
    • The Journal of the Korea Contents Association
    • /
    • v.15 no.4
    • /
    • pp.9-16
    • /
    • 2015
  • On modern operating systems such as Linux, virtual memory is a general way to provide a large address space to applications by using main memory and storage devices. Recently, storage devices have been improved in terms of latency and bandwidth, and it is expected that applications with large memory show high-performance if next-generation storage devices are considered. However, due to the overhead of virtual memory subsystem, the paging system can not exploit the performance of next-generation storage devices. In this study, we propose several optimization techniques to extend memory with next-generation storage devices. The techniques are to allocate block addresses of storage devices for write-back operations as well as to configure the system parameters, and we implement the techniques on Linux 3.14.3. Our evaluation through using multiple benchmarks shows that our system has 3 times (/24%) better performance on average than the baseline system in the micro(/macro)-benchmark.

Smart Bus Arbiter for QoS control in H.264 decoders

  • Lee, Chan-Ho
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.11 no.1
    • /
    • pp.33-39
    • /
    • 2011
  • H.264 decoders usually have pipeline architecture by a macroblock or a 4 ${\times}$ 4 sub-block. The period of the pipeline is usually fixed to guarantee the operation in the worst case which results in many idle cycles and higher data bandwidth. Adaptive pipeline architecture for H.264 decoders has been proposed for efficient decoding and lower the requirement of the bandwidth for the memory bus. However, it requires a controller for the adaptive priority control to utilize the advantage. We propose a smart bus arbiter that replaces the controller. It is introduced to adjust the priority adaptively the QoS (Quality of Service) control of the decoding process. The smart arbiter can be integrated the arbiter of bus systems and it works when certain conditions are met so that it does not affect the original functions of the arbiter. An H.264 decoder using the proposed architecture is designed and implemented to verify the operation using an FPGA.