• 제목/요약/키워드: Pipelined architecture

Search Result 176, Processing Time 0.022 seconds

Scalable Broadcast Switch Architecture (가변형 방송 스위치 구조)

  • 정갑중;이범철
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2004.05b
    • /
    • pp.291-294
    • /
    • 2004
  • In this paper, we consider the broadcast switch architecture for hish performance multicast packet switching. In input and output buffered switch, we propose a new switch architecture which supports high throughput in broadcast packet switching with switch planes of single input and multiple output crossbars. The proposed switch architecture has a central arbiter that arbitrates requests from plural input ports and generates multiple grant signals to multiple output ports in a packet transmission slot. It provides high speed pipelined arbitration and large scale switching capacity.

  • PDF

Energy Efficient and Low-Cost Server Architecture for Hadoop Storage Appliance

  • Choi, Do Young;Oh, Jung Hwan;Kim, Ji Kwang;Lee, Seung Eun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.12
    • /
    • pp.4648-4663
    • /
    • 2020
  • This paper proposes the Lempel-Ziv 4(LZ4) compression accelerator optimized for scale-out servers in data centers. In order to reduce CPU loads caused by compression, we propose an accelerator solution and implement the accelerator on an Field Programmable Gate Array(FPGA) as heterogeneous computing. The LZ4 compression hardware accelerator is a fully pipelined architecture and applies 16 dictionaries to enhance the parallelism for high throughput compressor. Our hardware accelerator is based on the 20-stage pipeline and dictionary architecture, highly customized to LZ4 compression algorithm and parallel hardware implementation. Proposing dictionary architecture allows achieving high throughput by comparing input sequences in multiple dictionaries simultaneously compared to a single dictionary. The experimental results provide the high throughput with intensively optimized in the FPGA. Additionally, we compare our implementation to CPU implementation results of LZ4 to provide insights on FPGA-based data centers. The proposed accelerator achieves the compression throughput of 639MB/s with fine parallelism to be deployed into scale-out servers. This approach enables the low power Intel Atom processor to realize the Hadoop storage along with the compression accelerator.

A Load Balancing Method using Partition Tuning for Pipelined Multi-way Hash Join (다중 해시 조인의 파이프라인 처리에서 분할 조율을 통한 부하 균형 유지 방법)

  • Mun, Jin-Gyu;Jin, Seong-Il;Jo, Seong-Hyeon
    • Journal of KIISE:Databases
    • /
    • v.29 no.3
    • /
    • pp.180-192
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new harsh join methods in the shared-nothing multiprocessor environment. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets dynamically via a frequency distribution. Using harsh-based joins, multiple joins can be pipelined to that the early results from a join, before the whole join is completed, are sent to the next join processing without staying in disks. Shared nothing multiprocessor architecture is known to be more scalable to support very large databases. However, this hardware structure is very sensitive to the data skew. Unless the pipelining execution of multiple hash joins includes some dynamic load balancing mechanism, the skew effect can severely deteriorate the system performance. In this parer, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

8-bit 10-MHz A/D Converter for Video Signal Processing (영상 신호 처리용 8-bit 10-MHz A/D 변환기)

  • Park Chang-Sun;Son Ju-Ho;Lee Jun-Ho;Kim Chong-Min;Kim Dong-Yong
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • autumn
    • /
    • pp.173-176
    • /
    • 1999
  • In this work, a A/D converter is implemented to obtain 8bit resolution at a conversion rate of 10Msample/s for video applications. Proposed architecture is designed low power A/D converter that pipelined architecture consists of flash A/D converter. This architecture consists of two identical stages that consist of sample/hold circuit, low power comparator, voltage reference circuit and MDAC of binary weighted capacitor array. Proposed A/D converter is designed using $0.25{\mu}m$ CMOS technology The SNR is 76.3dB at a sampling rate of 10MHz with 3.9MHz sine input signal. When an 8bit 10Msample/s A/D converter is simulated, the Differential Nonlinearity / Integral Nonlinearity (DNL/ INL) error are ${\pm}0.5/{\pm}2$ LSB, respectively. The power consumption is 13mW at 10Msample/s.

  • PDF

Architecture of Multiple-Queue Manager for Input-Queued Switch Tolerating Arbitration Latency (중재 지연 내성을 가지는 입력 큐 스위치의 다중 큐 관리기 구조)

  • 정갑중;이범철
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.26 no.12C
    • /
    • pp.261-267
    • /
    • 2001
  • This paper presents the architecture of multiple-queue manager for input-queued switch, which has arbitration latency, and the design of the chip. The proposed architecture of multiple-queue manager provides wire-speed routing with a pipelined buffer management, and the tolerance of requests and grants data transmission latency between the input queue manager and central arbiter using a new request control method, which is based on a high-speed shifter. The multiple-input-queue manager has been implemented in a field programmable gate array chip, which provides OC-48c port speed. It enhances the maximum throughput of the input queuing switch up to 98.6% with 128-cell shared input buffer in 16$\times$16 switch size.

  • PDF

A Design of A Multistandard Digital Video Encoder using a Pipelined Architecture

  • Oh, Seung-Ho;Park, Han-Jun;Kwon, Sung-Woo;Lee, Moon-Key
    • Journal of Electrical Engineering and information Science
    • /
    • v.2 no.5
    • /
    • pp.9-16
    • /
    • 1997
  • This paper describes the design of a multistandard video encoder. The proposed encoder accepts conventional NTSC/PAL video signals, It also processes he PAL-plus video signal which is now popular in Europe. The encoder consists of five major building functions which are letter-box converter, color space converter, digital filters, color modulator and timing generator. In order to support multistandard video signals, a programmable systolic architecture is adopted in designing various digital filters. Interpolation digital filters are also used to enhance signal-to-noise ratio of encoded video signals. The input to the encoder can be either YCbCr signal or RGB signal. The outputs re luminance(Y), chrominance(C), and composite video baseband(Y+C) signals. The architecture of the encoder is defined by using Matlab program and is modelled by using Veriflog-HDL language. The overall operation is verified by using various video signals, such as color bar patterns, ramp signals, and so on. The encoder contains 42K gates and is implemented by using 0.6um CMOS process.

  • PDF

Deblocking Filter Based on Edge-Preserving Algorithm And an Efficient VLSI Architecture (경계선 보존 알고리즘 기반의 디블로킹 필터와 효율적인 VLSI 구조)

  • Vinh, Truong Quang;Kim, Ji-Hoon;Kim, Young-Chul
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.11C
    • /
    • pp.662-672
    • /
    • 2011
  • This paper presents a new edge-preserving algorithm and its VLSI architecture for block artifact reduction. Unlike previous approaches using block classification, our algorithm utilizes pixel classification to categorize each pixel into one of two classes, namely smooth region and edge region, which are described by the edge-preserving maps. Based on these maps, a two-step adaptive filter which includes offset filtering and edge-preserving filtering is used to remove block artifacts. A pipelined VLSI architecture of the proposed deblocking algorithm for HD video processing is also presented in this paper. A memory-reduced architecture for a block buffer is used to optimize memory usage. The architecture of the proposed deblocking filter is prototyped on FPGA Cyclone II, and then we estimated performance when the filter is synthesized on ANAM 0.25 ${\mu}m$ CMOS cell library using Synopsys Design Compiler. Our experimental results show that our proposed algorithm effectively reduces block artifacts while preserving the details.

A hardware architecture based on the NCC algorithm for fast disparity estimation in 3D shape measurement systems (고밀도 3D 형상 계측 시스템에서의 고속 시차 추정을 위한 NCC 알고리즘 기반 하드웨어 구조)

  • Bae, Kyeong-Ryeol;Kwon, Soon;Lee, Yong-Hwan;Lee, Jong-Hun;Moon, Byung-In
    • Journal of Sensor Science and Technology
    • /
    • v.19 no.2
    • /
    • pp.99-111
    • /
    • 2010
  • This paper proposes an efficient hardware architecture to estimate disparities between 2D images for generating 3D depth images in a stereo vision system. Stereo matching methods are classified into global and local methods. The local matching method uses the cost functions based on pixel windows such as SAD(sum of absolute difference), SSD(sum of squared difference) and NCC(normalized cross correlation). The NCC-based cost function is less susceptible to differences in noise and lighting condition between left and right images than the subtraction-based functions such as SAD and SSD, and for this reason, the NCC is preferred to the other functions. However, software-based implementations are not adequate for the NCC-based real-time stereo matching, due to its numerous complex operations. Therefore, we propose a fast pipelined hardware architecture suitable for real-time operations of the NCC function. By adopting a block-based box-filtering scheme to perform NCC operations in parallel, the proposed architecture improves processing speed compared with the previous researches. In this architecture, it takes almost the same number of cycles to process all the pixels, irrespective of the window size. Also, the simulation results show that its disparity estimation has low error rate.

Low Area Hardware Design of Efficient SAO for HEVC Encoder (HEVC 부호기를 위한 효율적인 SAO의 저면적 하드웨어 설계)

  • Cho, Hyunpyo;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.1
    • /
    • pp.169-177
    • /
    • 2015
  • This paper proposes a hardware architecture for an efficient SAO(Sample Adaptive Offset) with low area for HEVC(High Efficiency Video Coding) encoder. SAO is a newly adopted technique in HEVC as part of the in-loop filter. SAO reduces mean sample distortion by adding offsets to reconstructed samples. The existing SAO requires a great deal of computational and processing time for UHD(Ultra High Definition) video due to sample by sample processing. To reduce SAO processing time, the proposed SAO hardware architecture processes four samples simultaneously, and is implemented with a 2-step pipelined architecture. In addition, to reduce hardware area, it has a single architecture for both luma and chroma components and also uses optimized and common operators. The proposed SAO hardware architecture is designed using Verilog HDL(Hardware Description Language), and has a total of 190k gates in TSMC $0.13{\mu}m$ CMOS standard cell library. At 200MHz, it can support 4K UHD video encoding at 60fps in real time, but operates at a maximum of 250MHz.

Design of an Area-Efficient Architecture for Block-wise MAP Turbo Decoder (면적 효율적인 구조의 블록 MAP 터보 복호기 설계)

  • Kang, Moon-Jun;Kim, Sik;Hwang, Sun-Young
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.8A
    • /
    • pp.725-732
    • /
    • 2002
  • Block-wise MAP (Maximum A posteriori) decoding algorithm for turbo-codes requires less memory than Log-MAP decoding algorithm. The ER (Bit Error Rate) performance of previous block-wise MAP decoding algorithm depend on the block length and training length. To maximize hardware utilization and perform successive decoding, the block length is set to be equal to the training length in previous MAP decoding algorithms. Simulation result on the BER performance shows that the EBR performance can be maintained with shorter blocks when training length is sufficient. This paper proposes an architecture for area efficient block-wise MAP decoder. The proposed architecture employs the decoding schema for reducing memory by using the training length, which in N times larger than block length. To efficiently handle the proposed schema, a pipelined architecture is proposed. Simulation results show that memory usage can be reduced by 30%~45% in the proposed architecture without degrading the BER performance.