• Title/Summary/Keyword: Parallel pipeline

Search Result 172, Processing Time 0.023 seconds

Hardware Design of Efficient SAO for High Performance In-loop filters (고성능 루프내 필터를 위한 효율적인 SAO 하드웨어 설계)

  • Park, Seungyong;Ryoo, Kwangki
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.543-545
    • /
    • 2017
  • This paper describes the SAO hardware architecture design for high performance in-loop filters. SAO is an inner module of in-loop filter, which compensates for information loss caused by block-based image compression and quantization. However, HEVC's SAO requires a high computation time because it performs pixel-unit operations. Therefore, the SAO hardware architecture proposed in this paper is based on a $4{\times}4$ block operation and a 2-stage pipeline structure for high-speed operation. The information generation and offset computation structure for SAO computation is designed in a parallel structure to minimize computation time. The proposed hardware architecture was designed with Verilog HDL and synthesized with TSMC chip process 130nm and 65nm cell library. The proposed hardware design achieved a maximum frequency of 476MHz yielding 163k gates and 312.5MHz yielding 193.6k gates on the 130nm and 65nm processes respectively.

  • PDF

Parallel Distributed Implementation of GHT on Ethernet Multicluster (이더넷 다중 클러스터에서 GHT의 병렬 분산 구현)

  • Kim, Yeong-Soo;Kim, Myung-Ho;Choi, Heung-Moon
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.3
    • /
    • pp.96-106
    • /
    • 2009
  • Extending the scale of the distributed processing in a single Ethernet cluster is physically restricted by maximum ports per switch. This paper presents an implementation of MPI-based multicluster consisting of multiple Ethernet switches for extending the scale of distributed processing, and a asymptotical analysis for communication overhead through execution-time analysis model. To determine an optimum task partitioning, we analyzed the processing time for various partitioning schemes, and AAP(accumulator array partitioning) scheme was finally chosen to minimize the overall communication overhead. The scope of data partitioned in AAP was modified to fit for incremented nodes, and suitable load balancing algorithm was implemented. We tried to alleviate the communication overhead through exploiting the pipelined broadcast and flat-tree based result gathering, and overlapping of the communication and the computation time. We used the linear pipeline broadcast to reduce the communication overhead in intercluster which is interconnected by a single link. Experimental results shows nearly linear speedup by the proposed parallel distributed GHT implemented on MPI-based Ethernet multicluster with four 100Mbps Ethernet switches and up to 128 nodes of Pentium PC.

Optimized Hardware Design using Sobel and Median Filters for Lane Detection

  • Lee, Chang-Yong;Kim, Young-Hyung;Lee, Yong-Hwan
    • Journal of Advanced Information Technology and Convergence
    • /
    • v.9 no.1
    • /
    • pp.115-125
    • /
    • 2019
  • In this paper, the image is received from the camera and the lane is sensed. There are various ways to detect lanes. Generally, the method of detecting edges uses a lot of the Sobel edge detection and the Canny edge detection. The minimum use of multiplication and division is used when designing for the hardware configuration. The images are tested using a black box image mounted on the vehicle. Because the top of the image of the used the black box is mostly background, the calculation process is excluded. Also, to speed up, YCbCr is calculated from the image and only the data for the desired color, white and yellow lane, is obtained to detect the lane. The median filter is used to remove noise from images. Intermediate filters excel at noise rejection, but they generally take a long time to compare all values. In this paper, by using addition, the time can be shortened by obtaining and using the result value of the median filter. In case of the Sobel edge detection, the speed is faster and noise sensitive compared to the Canny edge detection. These shortcomings are constructed using complementary algorithms. It also organizes and processes data into parallel processing pipelines. To reduce the size of memory, the system does not use memory to store all data at each step, but stores it using four line buffers. Three line buffers perform mask operations, and one line buffer stores new data at the same time as the operation. Through this work, memory can use six times faster the processing speed and about 33% greater quantity than other methods presented in this paper. The target operating frequency is designed so that the system operates at 50MHz. It is possible to use 2157fps for the images of 640by360 size based on the target operating frequency, 540fps for the HD images and 240fps for the Full HD images, which can be used for most images with 30fps as well as 60fps for the images with 60fps. The maximum operating frequency can be used for larger amounts of the frame processing.

A Pipelined Hash Join Method for Load Balancing (부하 균형 유지를 고려한 파이프라인 해시 조인 방법)

  • Moon, Jin-Gue;Park, No-Sang;Kim, Pyeong-Jung;Jin, Seong-Il
    • The KIPS Transactions:PartD
    • /
    • v.9D no.5
    • /
    • pp.755-768
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new hash join methods with load balancing capabilities. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets adaptively via a frequency distribution. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent to the next join processing without staying on disks. Unless the pipelining execution of multiple hash joins includes some load balancing mechanisms, the skew effect can severely deteriorate system performance. In this paper, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

A High-speed Packet Filtering System Architecture in Signature-based Network Intrusion Prevention (시그내쳐 기반의 네트워크 침입 방지에서 고속의 패킷 필터링을 위한 시스템 구조)

  • Kim, Dae-Young;Kim, Sun-Il;Lee, Jun-Yong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.34 no.2
    • /
    • pp.73-83
    • /
    • 2007
  • In network intrusion prevention, attack packets are detected and filtered out based on their attack signatures. Pattern matching is extensively used to find attack signatures and the most time-consuming execution part of Network Intrusion Prevention Systems(NIPS). Pattern matching is usually accelerated by hardware and should be performed at wire speed in NIPS. However, that alone is not good enough. First, pattern matching hardware should be able to generate sufficient pattern match information including the pattern index number and the location of the match found at wire speed. Second, it should support pattern grouping to reduce unnecessary pattern matches. Third, it should always have a constant worst-case performance even if the number of patterns is increased. Finally it should be able to update patterns in a few minutes or seconds without stopping its operations, We propose a system architecture to meet the above requirement. The system architecture can process multiple pattern characters in parallel and employs a pipeline architecture to achieve high speed. Using Xilinx FPGA simulation, we show that the new system stales well to achieve a high speed oner 10Gbps and satisfies all of the above requirements.

Optimized Hardware Design of Deblocking Filter for H.264/AVC (H.264/AVC를 위한 디블록킹 필터의 최적화된 하드웨어 설계)

  • Jung, Youn-Jin;Ryoo, Kwang-Ki
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.47 no.1
    • /
    • pp.20-27
    • /
    • 2010
  • This paper describes a design of 5-stage pipelined de-blocking filter with power reduction scheme and proposes a efficient memory architecture and filter order for high performance H.264/AVC Decoder. Generally the de-blocking filter removes block boundary artifacts and enhances image quality. Nevertheless filter has a few disadvantage that it requires a number of memory access and iterated operations because of filter operation for 4 time to one edge. So this paper proposes a optimized filter ordering and efficient hardware architecture for the reduction of memory access and total filter cycles. In proposed filter parallel processing is available because of structured 5-stage pipeline consisted of memory read, threshold decider, pre-calculation, filter operation and write back. Also it can reduce power consumption because it uses a clock gating scheme which disable unnecessary clock switching. Besides total number of filtering cycle is decreased by new filter order. The proposed filter is designed with Verilog-HDL and functionally verified with the whole H.264/AVC decoder using the Modelsim 6.2g simulator. Input vectors are QCIF images generated by JM9.4 standard encoder software. As a result of experiment, it shows that the filter can make about 20% total filter cycles reduction and it requires small transposition buffer size.

Floating Point Unit Design for the IEEE754-2008 (IEEE754-2008을 위한 고속 부동소수점 연산기 설계)

  • Hwang, Jin-Ha;Kim, Hyun-Pil;Park, Sang-Su;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.10
    • /
    • pp.82-90
    • /
    • 2011
  • Because of the development of Smart phone devices, the demands of high performance FPU(Floating-point Unit) becomes increasing. Therefore, we propose the high-speed single-/double-precision FPU design that includes an elementary add/sub unit and improved multiplier and compare and convert units. The most commonly used add/sub unit is optimized by the parallel rounding unit. The matrix operation is used in complex calculation something like a graphic calculation. We designed the Multiply-Add Fused(MAF) instead of multiplier to calculate the matrix more quickly. The branch instruction that is decided by the compare operation is very frequently used in various programs. We bypassed the result of the compare operation before all the pipeline processes ended to decrease the total execution time. And we included additional convert operations that are added in IEEE754-2008 standard. To verify our RTL designs, we chose four hundred thousand test vectors by weighted random method and simulated each unit. The FPU that was synthesized by Samsung's 45-nm low-power process satisfied the 600-MHz operation frequency. And we confirm a reduction in area by comparing the improved FPU with the existing FPU.

A Load Balancing Method using Partition Tuning for Pipelined Multi-way Hash Join (다중 해시 조인의 파이프라인 처리에서 분할 조율을 통한 부하 균형 유지 방법)

  • Mun, Jin-Gyu;Jin, Seong-Il;Jo, Seong-Hyeon
    • Journal of KIISE:Databases
    • /
    • v.29 no.3
    • /
    • pp.180-192
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new harsh join methods in the shared-nothing multiprocessor environment. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets dynamically via a frequency distribution. Using harsh-based joins, multiple joins can be pipelined to that the early results from a join, before the whole join is completed, are sent to the next join processing without staying in disks. Shared nothing multiprocessor architecture is known to be more scalable to support very large databases. However, this hardware structure is very sensitive to the data skew. Unless the pipelining execution of multiple hash joins includes some dynamic load balancing mechanism, the skew effect can severely deteriorate the system performance. In this parer, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

Hardware Design of High Performance HEVC Deblocking Filter for UHD Videos (UHD 영상을 위한 고성능 HEVC 디블록킹 필터 설계)

  • Park, Jaeha;Ryoo, Kwangki
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.19 no.1
    • /
    • pp.178-184
    • /
    • 2015
  • This paper proposes a hardware architecture for high performance Deblocking filter(DBF) in High Efficiency Video Coding for UHD(Ultra High Definition) videos. This proposed hardware architecture which has less processing time has a 4-stage pipelined architecture with two filters and parallel boundary strength module. Also, the proposed filter can be used in low-voltage design by using clock gating architecture in 4-stage pipeline. The segmented memory architecture solves the hazard issue that arises when single port SRAM is accessed. The proposed order of filtering shortens the delay time that arises when storing data into the single port SRAM at the pre-processing stage. The DBF hardware proposed in this paper was designed with Verilog HDL, and was implemented with 22k logic gates as a result of synthesis using TSMC 0.18um CMOS standard cell library. Furthermore, the dynamic frequency can process UHD 8k($7680{\times}4320$) samples@60fps using a frequency of 150MHz with an 8K resolution and maximum dynamic frequency is 285MHz. Result from analysis shows that the proposed DBF hardware architecture operation cycle for one process coding unit has improved by 32% over the previous one.

A Study on Motion Estimation Encoder Supporting Variable Block Size for H.264/AVC (H.264/AVC용 가변 블록 크기를 지원하는 움직임 추정 부호기의 연구)

  • Kim, Won-Sam;Sohn, Seung-Il
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.12 no.10
    • /
    • pp.1845-1852
    • /
    • 2008
  • The key elements of inter prediction are motion estimation(ME) and motion compensation(MC). Motion estimation is to find the optimum motion vectors, not only by using a distance criteria like the SAD, but also by taking into account the resulting number of 비트s in the 비트 stream. Motion compensation is compensate for movement of blocks of current frame. Inter-prediction Encoding is always the main bottleneck in high-quality streaming applications. Therefore, in real-time streaming applications, dedicated hardware for executing Inter-prediction is required. In this paper, we studied a motion estimator(ME) for H.264/AVC. The designed motion estimator is based on 2-D systolic array and it connects processing elements for fast SAD(Sum of Absolute Difference) calculation in parallel. By providing different path for the upper and lower lesion of each reference data and adjusting the input sequence, consecutive calculation for motion estimation is executed without pipeline stall. With data reuse technique, it reduces memory access, and there is no extra delay for finding optimal partitions and motion vectors. The motion estimator supports variable-block size and takes 328 cycles for macro-block calculation. The proposed architecture is local memory-free different from paper [6] using local memory. This motion estimation encoder can be applicable to real-time video processing.