Search | Korea Science

New Two-Level L1 Data Cache Bypassing Technique for High Performance GPUs

Kim, Gwang Bok;Kim, Cheol Hong
- Journal of Information Processing Systems
- /
- v.17 no.1
- /
- pp.51-62
- /
- 2021
On-chip caches of graphics processing units (GPUs) have contributed to improved GPU performance by reducing long memory access latency. However, cache efficiency remains low despite the facts that recent GPUs have considerably mitigated the bottleneck problem of L1 data cache. Although the cache miss rate is a reasonable metric for cache efficiency, it is not necessarily proportional to GPU performance. In this study, we introduce a second key determinant to overcome the problem of predicting the performance gains from L1 data cache based on the assumption that miss rate only is not accurate. The proposed technique estimates the benefits of the cache by measuring the balance between cache efficiency and throughput. The throughput of the cache is predicted based on the warp occupancy information in the warp pool. Then, the warp occupancy is used for a second bypass phase when workloads show an ambiguous miss rate. In our proposed architecture, the L1 data cache is turned off for a long period when the warp occupancy is not high. Our two-level bypassing technique can be applied to recent GPU models and improves the performance by 6% on average compared to the architecture without bypassing. Moreover, it outperforms the conventional bottleneck-based bypassing techniques.
https://doi.org/10.3745/JIPS.01.0062 인용 PDF KSCI

FPGA-based One-Chip Architecture and Design of Real-time Video CODEC with Embedded Blind Watermarking (블라인드 워터마킹을 내장한 실시간 비디오 코덱의 FPGA기반 단일 칩 구조 및 설계)

서영호;김대경;유지상;김동욱
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.29 no.8C
- /
- pp.1113-1124
- /
- 2004
In this paper, we proposed a hardware(H/W) structure which can compress and recontruct the input image in real time operation and implemented it into a FPGA platform using VHDL(VHSIC Hardware Description Language). All the image processing element to process both compression and reconstruction in a FPGA were considered each of them was mapped into H/W with the efficient structure for FPGA. We used the DWT(discrete wavelet transform) which transforms the data from spatial domain to the frequency domain, because use considered the motion JPEG2000 as the application. The implemented H/W is separated to both the data path part and the control part. The data path part consisted of the image processing blocks and the data processing blocks. The image processing blocks consisted of the DWT Kernel fur the filtering by DWT, Quantizer/Huffman Encoder, Inverse Adder/Buffer for adding the low frequency coefficient to the high frequency one in the inverse DWT operation, and Huffman Decoder. Also there existed the interface blocks for communicating with the external application environments and the timing blocks for buffering between the internal blocks The global operations of the designed H/W are the image compression and the reconstruction, and it is operated by the unit of a field synchronized with the A/D converter. The implemented H/W used the 69%(16980) LAB(Logic Array Block) and 9%(28352) ESB(Embedded System Block) in the APEX20KC EP20K600CB652-7 FPGA chip of ALTERA, and stably operated in the 70MHz clock frequency. So we verified the real time operation of 60 fields/sec(30 frames/sec).
PDF KSCI

Hardware Design of Enhanced Real-Time Sound Direction Estimation System (향상된 실시간 음원방향 인지 시스템의 하드웨어 설계)

Kim, Tae-Wan;Kim, Dong-Hoon;Chung, Yun-Mo
- The Journal of the Acoustical Society of Korea
- /
- v.30 no.3
- /
- pp.115-122
- /
- 2011
In this paper, we present a method to estimate an accurate real-time sound source direction based on time delay of arrival by using generalized cross correlation with four cross-type microphones. In general, existing systems have two disadvantages such as system embedding limitation due to the necessity of data acquisition for signal processing from microphone input, and real-time processing difficulty because of the increased number of channels for sound direction estimation using DSP processors. To cope with these disadvantages, the system considered in this paper proposes hardware design for enhanced real-time processing using microphone array signal processing. An accurate direction estimation and its design time reduction is achieved by means of an efficient hardware design using spatial segmentation methods and verification techniques. Finally we develop a system which can be used for embedded systems using a sound codec and an FPGA chip. According to experimental results, the system gives much faster real-time processing time compared with either PC-based systems or the case with DSP processors.
https://doi.org/10.7776/ASK.2011.30.3.115 인용 PDF KSCI

A VLSI Architecture for the Real-Time 2-D Digital Signal Processing (실시간 2차원 디지털 신호처리를 위한 VLSI 구조)

권희훈
- Information and Communications Magazine
- /
- v.9 no.9
- /
- pp.72-85
- /
- 1992
The throughput requirement for many digital signal processing is such that multiple processing units are essential for real-time implementation. Advances in VLSI technology make it feasible to design and implement computer systems consisting of a large number of function units. The research on a very high throughput VLSI architecture for digital signal processing applications requires the development of an algorithm, decomposition scheme which can minimize data communication requirements as well as minimize computational complexity. The objectives of the research are to investigate computationally efficient algorithms for solution of the class of problems which can be modeled as DLSI systems or adaptive system, and develop VLSI architectures and associated multiprocessor systems which can be used to implement these algorithms in real-time. A new VLSI architecture for real-time 2-D digital signal processing applications is proposed in this research. This VLSI architecture extends the concept of having a single processing units in a chip. Because this VLSI architecture has the advantage that the complexity and the number of computations per input does not increase as the size of the input data in increased, it can process very large 2-D date in near real-time.
PDF

An Adaptive Decision-Feedback Equalizer Architecture using RB Complex-Number Filter and chip-set design (RB 복소수 필터를 이용한 적응 결정귀환 등화기 구조 및 칩셋 설계)

Kim, Ho Ha;An, Byeong Gyu;Sin, Gyeong Uk
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.24 no.12A
- /
- pp.2015-2024
- /
- 1999
Presented in this paper are a new complex-umber filter architecture, which is suitable for an efficient implementation of baseband signal processing of digital communication systems, and a chip-set design of adaptive decision-feedback equalizer (ADFE) employing the proposed structure. The basic concept behind the approach proposed in this paper is to apply redundant binary (RB) arithmetic instead of conventional 2’s complement arithmetic in order to achieve an efficient realization of complex-number multiplication and accumulation. With the proposed way, an N-tap complex-number filter can be realized using 2N RB multipliers and 2N-2 RB adders, and each filter tap has its critical delay of $T_{m.RB}+T_{a.RB}$ (where $T_{m.RB}, T_{a.RB}$are delays of a RB multiplier and a RB adder, respectively), making the filter structure simple, as well as resulting in enhanced speed by means of reduced arithmetic operations. To demonstrate the proposed idea, a prototype ADFE chip-set, FFEM (Feed-Forward Equalizer Module) and DFEM (Decision-Feedback Equalizer Module) that can be cascaded to implement longer filter taps, has been designed. Each module is composed of two complex-number filter taps with their LMS coefficient update circuits, and contains about 26,000 gates. The chip-set was modeled and verified using COSSAP and VHDL, and synthesized using 0.8- μm SOG (Sea-Of-Gate) cell library.
PDF

Run-time Memory Optimization Algorithm for the DDMB Architecture (DDMB 구조에서의 런타임 메모리 최적화 알고리즘)

Cho, Jeong-Hun;Paek, Yun-Heung;Kwon, Soo-Hyun
- The KIPS Transactions:PartA
- /
- v.13A no.5 s.102
- /
- pp.413-420
- /
- 2006
Most vendors of digital signal processors (DSPs) support a Harvard architecture, which has two or more memory buses, one for program and one or more for data and allow the processor to access multiple words of data from memory in a single instruction cycle. We already addressed how to efficiently assign data to multi-memory banks in our previous work. This paper reports on our recent attempt to optimize run-time memory. The run-time environment for dual data memory banks (DBMBs) requires two run-time stacks to control activation records located in two memory banks corresponding to calling procedures. However, activation records of two memory banks for a procedure are able to have different size. As a consequence, dual run-time stacks can be unbalanced whenever a procedure is called. This unbalance between two memory banks causes that usage of one memory bank can exceed the extent of on-chip memory area although there is free area in the other memory bank. We attempt balancing dual run-time slacks to enhance efficiently utilization of on-chip memory in this paper. The experimental results have revealed that although our algorithm is relatively quite simple, it still can utilize run-time memories efficiently; thus enabling our compiler to run extremely fast, yet minimizing the usage of un-time memory in the target code.
https://doi.org/10.3745/KIPSTA.2006.13A.5.413 인용 PDF KSCI

Miniaturized Multilayer Band Pass Chip filter for IMT-2000 (IMT-2000용 초소헝 적층형 대역 통과 칩 필터 설계 및 제작)

Lim Hyuk;Ha, Jong-Yoon;Sim, Sung-Hun;Kang, Chong-Yun;Choi, Ji-Won;Choi, Se-Young;Oh, Young-Jei;Kim, Hyun-Jai;Yoon, Seok-Jin
- Journal of the Korean Ceramic Society
- /
- v.40 no.10
- /
- pp.961-966
- /
- 2003
A Multi-Layer Ceramic (MLC) chip type Band-Pass Filter (BPF) using BiNb$\_$0.975/Sb$\_$0.025/ $O_4$ LTCC (Low Temperature Co-fired Ceramics) and MLC processing is presented. The MLC chip BPF has the benefits of low cost and small size. The BPF consists of coupled stripline resonators and coupling capacitors. The BPF is designed to have an attenuation pole at below the passband for a receiver band of IMT-2000 handset. The computer-aided design technology is applied for analysis of the BPF frequency characteristics. The attenuation pole depends on the coupling between resonators and the coupling capacitance. An equivalent circuit and structure of MLC chip BPF are proposed. The frequency characteristics of the manufactured BPF is well acceptable for IMT-2000 application.
https://doi.org/10.4191/KCERS.2003.40.10.961 인용 PDF KSCI

A Design of Pipelined Adaptive Decision-Feedback Equalized using Delayed LMS and Redundant Binary Complex Filter Structure (Delayed LMS와 Redundant Binary 복소수 필터구조를 이용한 파이프라인 적응 결정귀환 등화기 설계)

An, Byung-Gyu;Lee, Jong-Nam;Shin, Kyung-Wook
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.37 no.12
- /
- pp.60-69
- /
- 2000
This paper describes a single-chip full-custom implementation of pipelined adaptive decision-feedback equalizer(PADFE) using a 0.25-${\mu}m$ CMOS technology for wide-band wireless digital communication systems. To enhance the throughput rate of ADFE, two pipeline stages are inserted into the critical path of the ADFE by using delayed least-mean-square(DLMS) algorithm. Redundant binary (RB) arithmetic is applied to all the data processing of the PADFE including filter taps and coefficient update blocks. When compared with conventional methods based on two's complement arithmetic, the proposed approach reduces arithmetic complexity, as well as results in a very simple complex-valued filter structure, thus suitable for VLSI implementation. The design parameters including pipeline stage, filter tap, coefficient and internal bit-width, and equalization performance such as bit error rate (BER) and convergence speed are analyzed by algorithm-level simulation using COSSAP. The single-chip PADFE contains about 205,000 transistors on an area of about $1.96\times1.35-mm^2$. Simulation results show that it can safely operate with 200-MHz clock frequency at 2.5-V supply, and its estimated power dissipation is about 890-mW. Test results show that the fabricated chip works functionally well.
PDF

A Multicellular Spheroid Formation and Extraction Chip Using Removable Cell Trapping Barriers (한시적 세포포집 구조물을 이용한 다세포 스페로이드 형성 및 추출칩)

Jin, Hye-Jin;Kim, Tae-Yoon;Cho, Young-Ho;Gu, Jin-Mo;Kim, Jhin-Gook;Oh, Yong-Soo
- Transactions of the Korean Society of Mechanical Engineers A
- /
- v.35 no.2
- /
- pp.131-134
- /
- 2011
We propose a spheroid chip that uses removable cell trapping barriers and that is capable of forming and extracting multicellular spheroids. By using a conventional well plate and flask, it is difficult to form small-sized spheroids, which resemble avascular 3D cell-cell interaction. It was difficult to extract spheroids using conventional microchips and fixed cell trapping barriers. The proposed chip, however, facilitates both formation and extraction of spheroids by using removable cell trapping barriers formed by membrane deflection. The cell trapping barriers, formed at the membrane pressure of 50 kPa, hold the cells in the trapping region at a cell inlet pressure of 145.155 Pa. After incubation for 24 h, the trapped cells form uniform spheroids. We successfully extract the spheroids at a cell inlet pressure of 5 kPa after removing the membrane pressure. The extracted spheroids have a diameter of $197.2{\pm}11.7Bm$ with a viability of $80.3{\pm}7.7%$. Using the proposed chip, uniform spheroids can be formed and these spheroids can be safely extracted for carrying out the post-processing of spheroids.
https://doi.org/10.3795/KSME-A.2011.35.2.131 인용 PDF KSCI

Design of an Image Processing ASIC Architecture using Parallel Approach with Zero or Little (통신부담을 감소시킨 영상처리를 위한 병렬처리 방식 ASIC구조 설계)

안병덕;정지원;선우명훈
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.19 no.10
- /
- pp.2043-2052
- /
- 1994
This paper proposes a new parallel ASIC architecture for real-time image processing to reduce inter-processing element (inter-PE) communication overhead, called a Sliding Memory Plane (SliM) Image Processor. The Slim Image Processor consists of $3\times3$ processing elements (PEs) connected by a mesh topology. With easy scalability due to the topology. a set of SliM Image Processors can form a mesh-connected SIMD parallel architecture. called the SliM Array Processor. The idea of sliding means that all pixels are slided into all neighboring PEs without interrupting PEs and without a coprocessor or a DMA controller. Since the inter-PE communication and computation occur simultaneously. the inter-PE communication overhead, significant disadvantage of existing machines greatly diminishes. Two I/O planes provide a buffering capability and reduce the date I/O overhead. In addition, using the by-passing path provides eight-way connectivity even with four links. with these salient features. SliM shows a significant performance improvement. This paper presents architectures of a PE and the SliM Image Processor, and describes the design of an instruction set.
PDF

Search Result 808, Processing Time 0.028 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)