• Title/Summary/Keyword: L3 cache

Search Result 18, Processing Time 0.023 seconds

MSHR-Aware Dynamic Warp Scheduler for High Performance GPUs (GPU 성능 향상을 위한 MSHR 활용률 기반 동적 워프 스케줄러)

  • Kim, Gwang Bok;Kim, Jong Myon;Kim, Cheol Hong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.8 no.5
    • /
    • pp.111-118
    • /
    • 2019
  • Recent graphic processing units (GPUs) provide high throughput by using powerful hardware resources. However, massive memory accesses cause GPU performance degradation due to cache inefficiency. Therefore, the performance of GPU can be improved by reducing thread parallelism when cache suffers memory contention. In this paper, we propose a dynamic warp scheduler which controls thread parallelism according to degree of cache contention. Usually, the greedy then oldest (GTO) policy for issuing warp shows lower parallelism than loose round robin (LRR) policy. Therefore, the proposed warp scheduler employs the LRR warp scheduling policy when Miss Status Holding Register(MSHR) utilization is low. On the other hand, the GTO policy is employed in order to reduce thread parallelism when MSHRs utilization is high. Our proposed technique shows better performance compared with LRR and GTO policy since it selects efficient scheduling policy dynamically. According to our experimental results, our proposed technique provides IPC improvement by 12.8% and 3.5% over LRR and GTO on average, respectively.

Proxy Cache Replacement Policy reflecting Network Transmission Costs in Web and Multimedia Environments (웹과 멀티미디어 요청이 혼재한 환경에서 네트워크 전송 비용을 고려한 프락시 캐시 교체 정책)

  • 서진모;강지숙;남동훈;박승규
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04a
    • /
    • pp.10-12
    • /
    • 2002
  • 사용자의 요구와 인터넷 어플리케이션의 발달로 규모가 큰 미디어 오브젝트의 수가 급증하고 있다. 따라서 네트워크 전송비용은 반드시 고려해야 하는 중요한 요소이다. 본 논문에서는 기존 프락시 캐시 교체 정책들을 분석하고, 이를 개선한 G-N 및 L-N 정책을 제안한다. 이것은 프락시 캐시 소프트웨어인 'Squid'에서 채택하고있는 GDSF와 LFU-DA 정책에 네트워크 전송 비용을 추가하여 확장한 알고리즘이다. 시뮬레이션을 통하여 기존의 알고리즘과 비교해 본 결과, 평균 응답 시간을 10%이상 감소시킬 수 있었으며, 추가로 드는 비용(Processing Overhead)은 3게 증가하지 아니 하였음을 확인하였다.

  • PDF

A 3-D Vision Sensor Implementation on Multiple DSPs TMS320C31 (다중 TMS320C31 DSP를 사용한 3-D 비젼센서 Implementation)

  • Oksenhendler, V.;Bensrhair, Abdelaziz;Miche, Pierre;Lee, Sang-Goog
    • Journal of Sensor Science and Technology
    • /
    • v.7 no.2
    • /
    • pp.124-130
    • /
    • 1998
  • High-speed 3D vision systems are essential for autonomous robot or vehicle control applications. In our study, a stereo vision process has been developed. It consists of three steps : extraction of edges in right and left images, matching corresponding edges and calculation of the 3D map. This process is implemented in a VME 150/40 Imaging Technology vision system. It is a modular system composed by a display, an acquisition, a four Mbytes image frame memory, and three computational cards. Programmable accelerator computational modules are running at 40 MHz and are based on TMS320C31 DSP with a $64{\times}32$ bit instruction cache and two $1024{\times}32$ bit internal RAMs. Each is equipped with 512 Kbytes static RAM, 4 Mbytes image memory, 1 Mbytes flash EEPROM and a serial port. Data transfers and communications between modules are provided by three 8 bit global video bus, and three local configurable pipeline 8 bit video bus. The VME bus is dedicated to system management. Tasks between DSPs are distributed as follows: two DSPs are used to edges detection, one for the right image and the other for the left one. The last processor computes the matching process and the 3D calculation. With $512{\times}512$ pixels images, this sensor generates dense 3D maps at a rate of about 1 Hz depending of the scene complexity. Results can surely be improved by using a special suited multiprocessors cards.

  • PDF

Performance Enhancement of Handover in mSCTP using Pre-acquisition RA in WLAN (WLAN에서 RA 선수신을 이용한 mSCTP 핸드오버 성능 향상)

  • Choi, Soon-Won;Kim, Kwang-Ryoul;Min, Sung-Gi
    • Journal of KIISE:Information Networking
    • /
    • v.33 no.2
    • /
    • pp.156-164
    • /
    • 2006
  • The SCTP (Stream Control Transmission Protocol) implementation with the DAR (Dynamic Address Reconfiguration) extension is called the mSCTP (Mobile SCTP) that is proposed recently for mobility support in transport layer. The mSCTP does not satisfy short handover latency for real-time applications and it has no specific handover decision mechanisms. In this paper, we propose fast handover schemes for mobile nodes that are moving into different subnet using pre-acquisition RA (Router Advertisement) and L3 trigger for improving handover performance. Furthermore, we introduce three specific methods which are RA cache, FMIPv6 (Fast Handovers for Mobile IPv6) and dual interface and how proposed scheme can be interoperated with handover process respectively. Finally, we show two experimental results which are the mSCTP and the mSCTP using FMIPv6 on Linux platforms. Experimental results show that handover performance is improved with reducing the time of receiving RA which takes most of total handover latency.

Accelerating Medical Image Processing on Integrated GPU Using OpenCL (OpenCL을 이용한 내장형 GPU에서의 의학영상처리 가속화)

  • Kim, Beom-Jun;Shin, Byeong-seok
    • Journal of the Korea Computer Graphics Society
    • /
    • v.23 no.2
    • /
    • pp.1-10
    • /
    • 2017
  • A variety of filters are applied to improve the quality of noise and low resolution medical images. This is necessary to reduce the radiation dose of the patient and to improve the utilization of the conventional spherical imaging equipment. In the conventional method, it is common to perform filtering using the CPU of the PC. However, it is difficult to produce results in real time by applying various calculations and filters to high-resolution human images using only the CPU performance of a PC used in a hospital. In this paper, we analyze the structure and performance of Intel integrated GPU in CPU and propose a method to perform image filtering using OpenCL parallel processing function. By applying complex filters with high computational complexity to medical images, high quality images can be generated in real time.

Binary Search on Multiple Small Trees for IP Address Lookup

  • Lee BoMi;Kim Won-Jung;Lim Hyesook
    • Proceedings of the IEEK Conference
    • /
    • 2004.06a
    • /
    • pp.175-178
    • /
    • 2004
  • This paper describes a new IP address lookup algorithm using a binary search on multiple balanced trees stored in one memory. The proposed scheme has 3 different tables; a range table, a main table, and multiple sub-tables. The range table includes $2^8$ entries of 22 bits wide. Each of the main table and sub-table entries is composed of fields for a prefix, a prefix length, the number of sub-table entries, a sub-table pointer, and a forwarding RAM pointer. Binary searches are performed in the main table and the multiple sub-tables in sequence. Address lookups in our proposed scheme are achieved by memory access times of 11 in average, 1 in minimum, and 24 in maximum using 267 Kbytes of memory for 38.000 prefixes. Hence the forwarding table of the proposed scheme is stored into L2 cache, and the address lookup algorithm is implemented in software running on general purpose processor. Since the proposed scheme only depends on the number of prefixes not the length of prefixes, it is easily scaled to IPv6.

  • PDF

Weighted Binary Prefix Tree for IP Address Lookup (IP 주소 검색을 위한 가중 이진 프리픽스 트리)

  • Yim Changhoon;Lim Hyesook;Lee Bomi
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.29 no.11B
    • /
    • pp.911-919
    • /
    • 2004
  • IP address lookup is one of the essential functions on internet routers, and it determines overall router performance. The most important evaluation factor for software-based IP address lookup is the number of the worst case memory accesses. Binary prefix tree (BPT) scheme gives small number of worst case memory accesses among previous software-based schemes. However the tree structure of BPT is normally unbalanced. In this paper, we propose weighted binary prefix tree (WBP) scheme which generates nearly balanced tree, through combining the concept of weight to the BPT generation process. The proposed WBPT gives very small number of worst case memory accesses compared to the previous software-based schemes. Moreover the WBPT requires comparably small size of memory which can be fit within L2 cache for about 30,000 prefixes, and it is rather simple for prefix addition and deletion. Hence the proposed WBPT can be used for software-based If address lookup in practical routers.

An Investigation of the Performance of the Colored Gauss-Seidel Solver on CPU and GPU (Coloring이 적용된 Gauss-Seidel 해법을 통한 CPU와 GPU의 연산 효율에 관한 연구)

  • Yoon, Jong Seon;Jeon, Byoung Jin;Choi, Hyoung Gwon
    • Transactions of the Korean Society of Mechanical Engineers B
    • /
    • v.41 no.2
    • /
    • pp.117-124
    • /
    • 2017
  • The performance of the colored Gauss-Seidel solver on CPU and GPU was investigated for the two- and three-dimensional heat conduction problems by using different mesh sizes. The heat conduction equation was discretized by the finite difference method and finite element method. The CPU yielded good performance for small problems but deteriorated when the total memory required for computing was larger than the cache memory for large problems. In contrast, the GPU performed better as the mesh size increased because of the latency hiding technique. Further, GPU computation by the colored Gauss-Siedel solver was approximately 7 times that by the single CPU. Furthermore, the colored Gauss-Seidel solver was found to be approximately twice that of the Jacobi solver when parallel computing was conducted on the GPU.