• Title/Summary/Keyword: Parallel Implementation

Search Result 883, Processing Time 0.02 seconds

Implementation of IQ/IDCT in H.264/AVC Decoder Using GPGPU (GPGPU를 이용한 H.264/AVC 디코더)

  • Kim, Dong-Han;Lee, Kwang-Yeob
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2010.05a
    • /
    • pp.162-164
    • /
    • 2010
  • H.264/AVC(Advanced Video Coding) is a standard for video compression. H.264/AVC provides good video quality at substantially lower bit rates than previous standards. In this papers, we propose the efficient architecture of H.264/AVC decoder using GPGPU. GPGPU can process many of operation in parallel. IQ/IDCT is possible that parallel processing in H.264/AVC decoding algorithm.

  • PDF

An Efficient Interpolation Hardware Architecture for HEVC Inter-Prediction Decoding

  • Jin, Xianzhe;Ryoo, Kwangki
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.2
    • /
    • pp.118-123
    • /
    • 2013
  • This paper proposes an efficient hardware architecture for high efficiency video coding (HEVC), which is the next generation video compression standard. It adopts several new coding techniques to reduce the bit rate by about 50% compared with the previous one. Unlike the previous H.264/AVC 6-tap interpolation filter, in HEVC, a one-dimensional seven-tap and eight-tap filter is adopted for luma interpolation, but it also increases the complexity and gate area in hardware implementation. In this paper, we propose a parallel architecture to boost the interpolation performance, achieving a luma $4{\times}4$ block interpolation in 2-4 cycles. The proposed architecture contains shared operations reducing the gate count increased due to the parallel architecture. This makes the area efficiency better than the previous design, in the best case, with the performance improved by about 75.15%. It is synthesized with the MagnaChip $0.18{\mu}m$ library and can reach the maximum frequency of 200 MHz.

Design and Implementation of a Massively Parallel Multithreaded Architecture: DAVRID

  • Sangho Ha;Kim, Junghwan;Park, Eunha;Yoonhee Hah;Sangyong Han;Daejoon Hwang;Kim, Heunghwan;Seungho Cho
    • Journal of Electrical Engineering and information Science
    • /
    • v.1 no.2
    • /
    • pp.15-26
    • /
    • 1996
  • MPAs(Massively Parallel Architectures) should address two fundamental issues for scalability: synchronization and communication latency. Dataflow architecture faces problems of excessive synchronization overhead and inefficient execution of sequential programs while they offer the ability to exploit massive parallelism inherent in programs. In contrast, MPAs based on von Neumann computational model may suffer from inefficient synchronization mechanism and communication latency. DAVRID (DAtaflow/Von Neumann RISC hybrID) is a massively parallel multithreaded architecture which takes advantages of von Neumann and dataflow models. It has good single thread performance as well as tolerates synchronization and communication latency. In this paper, we describe the DAVRID architecture in detail and evaluate its performance through simulation runs over several benchmarks.

  • PDF

FPGA Design of a Parallel Canny Edge Detector with Optimized Local Buffers (로컬 버퍼 최적화를 통한 병렬 처리 캐니 경계선 검출기의 FPGA 설계)

  • Ingi Min;Suhyun Sim;Seungwon Hwang;Sunhee Kim
    • Journal of the Semiconductor & Display Technology
    • /
    • v.22 no.4
    • /
    • pp.59-65
    • /
    • 2023
  • Edge detection in image processing and computer vision is one of the most fundamental operations. Canny edge detection algorithm has excellent performance and is currently widely used. However, it is difficult to process the algorithm in real-time because the algorithm is complex. In this study, the equations required in the algorithm were simplified to facilitate hardware implementation, and the calculation speed was increased by using a parallel structure. In particular, the size and management of local buffers were selected in consideration of parallel processing and filter size so that data could be processed without bottlenecks. It was designed in verilog and implemented in FPGA to verify operation and performance.

  • PDF

Design and Implementation of Visual Environment for Parallel Object-Oriented Programming (병렬 객체지향 프로그래밍을 위한 시각 환경의 설계 및 구현)

  • Choe, Suk-Yeong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.6 no.2
    • /
    • pp.485-496
    • /
    • 1999
  • Comparing with sequential programming, parallel programming has additional complexity due to the consideration of parallelism, communication and synchronization of processes. A synergism between users and compliers should be established, each assisting the other to produce high quality parallel programs. On the above underlying philosophy, we developed a parallel Object-Oriented specification language, POOSL, as preliminary works. However, it is still likely to hard for users to write parallel program because users have to consider grammar of POOSL and to write text-based parallel program. It would be more desirable to provide users wit visual environment for effective parallel programming. Therefore, we propose a visual programming environment. VEPO(Visual environment for Parallel Object-Oriented Programming), based on POOSL in order that users can develop parallel programs more easily and conveniently. It aims at supporting a programming environment in which users can represent their programs more naturally and visually I parallel manner with object-oriented concept and essential steps during parallel program development such as program specification, compilation, execution and animation of execution are integrated. VEPO has useful features for parallel processing. Especially, complicated parallel codes for synchronization and communication of processes are automatically generated in the translation phase, so users can be relieved of writing error-prone parallel codes. The system is targeted to the transputer-based parallel system, MC-3. The graphic user interface of VEPO was implemented using Visual C++. Visual programs descirbed on VEPO are translated into Inmos C and executed on MC-3.

  • PDF

Parallel Implementation of Scrypt: A Study on GPU Acceleration for Password-Based Key Derivation Function

  • SeongJun Choi;DongCheon Kim;Seog Chung Seo
    • Journal of information and communication convergence engineering
    • /
    • v.22 no.2
    • /
    • pp.98-108
    • /
    • 2024
  • Scrypt is a password-based key derivation function proposed by Colin Percival in 2009 that has a memory-hard structure. Scrypt has been intentionally designed with a memory-intensive structure to make password cracking using ASICs, GPUs, and similar hardware more difficult. However, in this study, we thoroughly analyzed the operation of Scrypt and proposed strategies to maximize computational parallelism in GPU environments. Through these optimizations, we achieved an outstanding performance improvement of 8284.4% compared with traditional CPU-based Scrypt computations. Moreover, the GPU-optimized implementation presented in this paper outperforms the simple GPU-based Scrypt processing by a significant margin, providing a performance improvement of 204.84% in the RTX3090. These results demonstrate the effectiveness of our proposed approach in harnessing the computational power of GPUs and achieving remarkable performance gains in Scrypt calculations. Our proposed implementation is the first GPU implementation of Scrypt, demonstrating the ability to efficiently crack Scrypt.

GPU-ACCELERATED SPECKLE MASKING RECONSTRUCTION ALGORITHM FOR HIGH-RESOLUTION SOLAR IMAGES

  • Zheng, Yanfang;Li, Xuebao;Tian, Huifeng;Zhang, Qiliang;Su, Chong;Shi, Lingyi;Zhou, Ta
    • Journal of The Korean Astronomical Society
    • /
    • v.51 no.3
    • /
    • pp.65-71
    • /
    • 2018
  • The near real-time speckle masking reconstruction technique has been developed to accelerate the processing of solar images to achieve high resolutions for ground-based solar telescopes. However, the reconstruction of solar subimages in such a speckle reconstruction is very time-consuming. We design and implement a new parallel speckle masking reconstruction algorithm based on the Compute Unified Device Architecture (CUDA) on General Purpose Graphics Processing Units (GPGPU). Tests are performed to validate the correctness of our program on NVIDIA GPGPU. Details of several parallel reconstruction steps are presented, and the parallel implementation between various modules shows a significant speed increase compared to the previous serial implementations. In addition, we present a comparison of runtimes across serial programs, the OpenMP-based method, and the new parallel method. The new parallel method shows a clear advantage for large scale data processing, and a speedup of around 9 to 10 is achieved in reconstructing one solar subimage of $256{\times}256pixels$. The speedup performance of the new parallel method exceeds that of OpenMP-based method overall. We conclude that the new parallel method would be of value, and contribute to real-time reconstruction of an entire solar image.

Construction of a CPU Cluster and Implementation of a 3-D Domain Decomposition Parallel FDTD Algorithm (CPU 클러스터 구축 및 3차원 공간분할 병렬 FDTD 알고리즘 구현)

  • Park, Sungmin;Chu, Kwang-Uk;Ju, Saehoon;Park, Yoon-Mi;Kim, Ki-Baek;Jung, Kyung-Young
    • The Journal of Korean Institute of Electromagnetic Engineering and Science
    • /
    • v.25 no.3
    • /
    • pp.357-364
    • /
    • 2014
  • In this work, we construct a CPU cluster to implement a parallel finite-difference time domain(FDTD) algorithm for fast electromagnetic analyses. This parallel FDTD algorithm can reduce the computational time significantly and also analyze electrically larger structures, compared to a single FDTD counterpart. The parallel FDTD algorithm needs communication between neighboring processors, which is performed by the MPI(Message Passing Interface) library and a 3-D domain decomposition is employed to decrease the communication time between neighboring processors. Compared to a single-processor FDTD, the speed up factor of a-CPU-cluster-based parallel FDTD algorithm is investigated for the normal mode and the hypermode and finally analyze an electrically large concrete structure by the developed parallel algorithm.

A Parallel Processing System for Visual Media Applications (시각매체를 위한 병렬처리 시스템)

  • Lee, Hyung;Pakr, Jong-Won
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.27 no.1A
    • /
    • pp.80-88
    • /
    • 2002
  • Visual media(image, graphic, and video) processing poses challenge from several perpectives, specifically from the point of view of real-time implementation and scalability. There have been several approaches to obtain speedups to meet the computing demands in multimedia processing ranging from media processors to special purpose implementations. A variety of parallel processing strategies are adopted in these implementations in order to achieve the required speedups. We have investigated a parallel processing system for improving the processing speed o f visual media related applications. The parallel processing system we proposed is similar to a pipelined memory stystem(MAMS). The multi-access memory system is made up of m memory modules and a memory controller to perform parallel memory access with a variety of combinations of 1${\times}$pq, pq${\times}$1, and p${\times}$q subarray, which improves both cost and complexity of control. Facial recognition, Phong shading, and automatic segmentation of moving object in image sequences are some that have been applied to the parallel processing system and resulted in faithful processing speed. This paper describes the parallel processing systems for the speedup and its utilization to three time-consuming applications.

Implementations of Hypercube Networks based on TCP/IP for PC Clusters (PC 클러스터를 위한 TCP/IP 기반 하이퍼큐브 네트워크 구현)

  • Lee, Hyung-Bong;Hong, Joon-Pyo;Kim, Young-Tae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.13 no.2
    • /
    • pp.221-233
    • /
    • 2008
  • In general, we use a Parallel processing computer manufactured specially for the purpose of parallel processing to do high performance computings. But we can depoly and use a PC cluster composed of several common PCs instead of the very expensive parallel processing computer. A common way to get a PC cluster is to adopt the star topology network connected by a switch hub. But in this paper, we grope efficient implementations of hypercube networks based on TCP/IP to connect 8 PCs directly for more useful parallel processing environment, and make evaluations on functionality and efficiency of them using ping, netperf, MPICH. The two proposed methods of implementation are IP configuration based on link and IP configuration based on node. The results of comparison between them show that there is not obvious difference in performance but the latter is more efficient in simplicity of routing table. For verification of functionality, we compare the parallel processing results of an application in them with the same in a star network based PC cluster. These results also show that the proposed hypercube networks support a perfect parallel processing environment respectively.

  • PDF