• Title/Summary/Keyword: Parallel Implementation

Search Result 880, Processing Time 0.027 seconds

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units (GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭)

  • Ho, ThienLuan;Kim, HyunJin;Oh, SeungRohk
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.66 no.6
    • /
    • pp.955-961
    • /
    • 2017
  • In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.

Implementation of Parallel Processing Based Pedestrian Detection Using a Modified CENTRIST Algorithm (개선된 CENTRIST 알고리즘을 적용한 병렬처리 기반 보행자 인식 구현)

  • Jung, Jun-Mo
    • Journal of IKEEE
    • /
    • v.18 no.3
    • /
    • pp.398-402
    • /
    • 2014
  • In this paper, we propose a parallel processing method of pedestrian detection algorithm based on ROI-CENTRIST. There is a difficulty in the real-time processing of pedestrian detection in the embedded environment, using the conventional pedestrian detection method. This problem can be solved by a parallel processing method of applying the ROI to the conventional algorithm. The proposed parallel processing method of pedestrian detection using ROI-CENTRIST show the result of 5.2 frames per second, which is about 10% improvement over the conventional pedestrian detection method based on CENTRIST.

Implementation and Performance Analysis of High Performance Computing Library for Parallel Processing (병렬처리를 위한 고성능 라이브러리의 구현과 성능 평가)

  • 김영태;이용권
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.31 no.7
    • /
    • pp.379-386
    • /
    • 2004
  • We designed a portable parallel library HPCL(High Performance Computing Library) with following objectives: (1) to provide a close relationship between the parallel code and the original sequential code that will help future versions of the sequential code and (2) to enhance performance of the parallel code. The library is an interface written in C and Fortran programming languages between MPI(Message Passing Interface) and parallel programs in Fortran. Performance results were determined on clusters of PC's and IBM SP4.

On Parallel Implementation of Lagrangean Approximation Procedure (Lagrangean 근사과정의 병렬계산)

  • 이호창
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.18 no.3
    • /
    • pp.13-34
    • /
    • 1993
  • By operating on many part of a software system concurrently, the parallel processing computers may provide several orders of magnitude more computing power than traditional serial computers. If the Lagrangean approximation procedure is applied to a large scale manufacturing problem which is decomposable into many subproblems, the procedure is a perfect candidate for parallel processing. By distributing Lagrangean subproblems for given multiplier to multiple processors, concurrently running processors and modifying Lagrangean multipliers at the end of each iteration of a subgradient method,a parallel processing of a Lagrangean approximation procedure may provide a significant speedup. This purpose of this research is to investigate the potential of the parallelized Lagrangean approximation procedure (PLAP) for certain combinational optimization problems in manufacturing systems. The framework of a Plap is proposed for some combinatorial manufacturing problems which are decomposable into well-structured subproblems. The synchronous PLAP for the multistage dynamic lot-sizing problem is implemented on a parallel computer Alliant FX/4 and its computational experience is reported as a promising application of vector-concurrent computing.

  • PDF

A Forward Closed-Form Position Solution, Kinematic Analysis And Implementation of a Translational 3-DOF Parallel Mechanism Formed by Constraining a Stewart Platform Structure (스트워트 플랫폼 구조를 구속하여 얻어지는 병진형 3 자유도 병렬 메커니즘의 정위치 해석해와 기구학 해석 및 구현)

  • Shin Dong-Min;Chung Jae-Heon;Oh Se-Min;Yi Byung-Ju;Kim Whee-Kuk
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.12 no.10
    • /
    • pp.1035-1043
    • /
    • 2006
  • In this study, a translational 3-DOF parallel mechanism formed by constraining the Stewart Platform Mechanism is investigated. The translational 3-DOF parallel mechanism has three struts(3-UPS type serial subchains) and in addition, has a PPP type serial subchain in the middle of the mechanism. Firstly, the closed-form forward and reverse position solutions are derived for this mechanism. And analysis on kinematic characteristics using isotropic index of the Jacobian is conducted to examine effects of design parameters for the mechanism. Lastly, a prototype mechanism is implemented and the kinematic performance of the translational 3-DOF parallel mechanism was verified through experimental work.

Parallel Implementation of Nonlinear Analysis Program of PSC Frame Using MPI (MPI를 이용한 PSC 프레임 비선형해석 프로그램의 병렬화)

  • 이재석;최규천
    • Proceedings of the Computational Structural Engineering Institute Conference
    • /
    • 2001.04a
    • /
    • pp.61-68
    • /
    • 2001
  • A parallel nonlinear analysis program of prestressed concrete frame is migrated on a PC cluster system and a massively parallel processing system, CRAY T3E system, using MPI. The PC cluster system is configured with Pentium Ⅲ class PCs and fast ethernet. The CRAY T3E system is composed of a set of nodes each containing one Processing Element (PE), a memory subsystem and its distributed memory interconnect network. Parallel computing algorithms are implemented on element-wise processing parts including the calculation of stiffness matrix, element stresses and determination of material states, check of material failure and calculation of unbalanced loads. Parallel performance of the migrated program is evaluated through typical numerical examples.

  • PDF

An Implementation of the DEVS Formalism on a Parallel Distributed Environment (병렬 분산 환경에서의 DEVS 형식론의 구현)

  • 성영락
    • Journal of the Korea Society for Simulation
    • /
    • v.1 no.1
    • /
    • pp.64-76
    • /
    • 1992
  • The DEVS(discrete event system specificaition) formalism specifies a discrete event system in a hierarchical, modular form. DEVSIM++ is a C++based general purpose DEVS abstract simulator which can simulate systems modeled by the DEVS formalism in a sequential environment. This paper describes P-DEVSIM++which is a parallel version of DEVSIM++ . In P-DEVSIM++, the external and internal event of DEVS models can by processed in parallel. For such processing, we propose a parallel, distributed optimistic simulation algorithm based on the Time Warp approach. However, the proposed algorithm localizes the rollback of a model within itself, not possible in the standard Time Warp approach. An advantage of such localization is that the simulation time may be reduced. To evaluate its performance, we simulate a single bus multiprocessor architecture system with an external common memory. Simulation result shows that significant speedup is made possible with our algorithm in a parallel environment.

  • PDF

A Multithreaded Implementation of HEVC Intra Prediction Algorithm for a Photovoltaic Monitoring System

  • Choi, Yung-Ho;Ahn, Hyung-Keun
    • Transactions on Electrical and Electronic Materials
    • /
    • v.13 no.5
    • /
    • pp.256-261
    • /
    • 2012
  • Recently, many photovoltaic systems (PV systems) including solar parks and PV farms have been built to prepare for the post fossil fuel era. To investigate the degradation process of the PV systems and thus, efficiently operate PV systems, there is a need to visually monitor PV systems in the range of infrared ray through the Internet. For efficient visual monitoring, this paper explores a multithreaded implementation of a recently developed HEVC standard whose compression efficiency is almost two times higher than H.264. For an efficient parallel implementation under a meshbased 64 multicore system, this work takes into account various design choices which can solve potential problems of a two-dimensional interconnects-based 64 multicore system. These problems may have not occurred in a small-scale multicore system based on a simple bus network. Through extensive evaluation, this paper shows that, for an efficient multithreaded implementation of HEVC intra prediction in a mesh-based multicore system, much effort needs to be made to optimize communications among processing cores. Thus, this work provides three design choices regarding communications, i.e., main thread core location, cache home policy, and maximum coding unit size. These design choices are shown to improve the overall parallel performance of the HEVC intra prediction algorithm by up to 42%, achieving a 7 times higher speed-up.

Design and Implementation of RAID Controller using Serial ATA Interface (Serial ATA Interface를 통한 RAID Controller 보드의 설계 및 구현)

  • Lim, Seung-Ho;Lee, Ju-Pyung;Park, Kyu-Ho
    • Proceedings of the KIEE Conference
    • /
    • 2003.11c
    • /
    • pp.665-668
    • /
    • 2003
  • In this paper, we have designed and implemented the RAID controller board which connects to the host computer with serial ATA interface and connects to the disks with parallel ATA interface. Serial ATA interface is proposed to overcome the design limitation of parallel ATA while enabling the storage interface to scale with the slowing media rate demands for PC platforms. Serial ATA is to replace parallel ATA with the compatibility with existing operating systems and drivers, adding performance headroom for years to come. It Moreover, serial ATA provides even faster transfer rate of 150 Mbytes/s which is larger than that of current parallel ATA. The RAID controller board designed in this paper combines up to 4 disks with parallel ATA interface, and connects to PC host computer with serial ATA interface. We have implemented RAID controller using Verilog HDL language with FPGA chip. The RAID controller supports RAID level 0 and 1 functionality. Experimently, the average read/write performance of parallel ATA interface is about 30 Mbytes/s. Therefore, when 4 parallel disks is connected to the RAID controller board, we can get almost full throughput of serial ATA protocol using the RAID level 0 configuration with 4 disks.

  • PDF

A Massively Parallel Algorithm for Fuzzy Vector Quantization (퍼지 벡터 양자화를 위한 대규모 병렬 알고리즘)

  • Huynh, Luong Van;Kim, Cheol-Hong;Kim, Jong-Myon
    • The KIPS Transactions:PartA
    • /
    • v.16A no.6
    • /
    • pp.411-418
    • /
    • 2009
  • Vector quantization algorithm based on fuzzy clustering has been widely used in the field of data compression since the use of fuzzy clustering analysis in the early stages of a vector quantization process can make this process less sensitive to its initialization. However, the process of fuzzy clustering is computationally very intensive because of its complex framework for the quantitative formulation of the uncertainty involved in the training vector space. To overcome the computational burden of the process, this paper introduces an array architecture for the implementation of fuzzy vector quantization (FVQ). The arrayarchitecture, which consists of 4,096 processing elements (PEs), provides a computationally efficient solution by employing an effective vector assignment strategy during the clustering process. Experimental results indicatethat the proposed parallel implementation providessignificantly greater performance and efficiency than appropriately scaled alternative array systems. In addition, the proposed parallel implementation provides 1000x greater performance and 100x higher energy efficiency than other implementations using today's ARMand TI DSP processors in the same 130nm technology. These results demonstrate that the proposed parallel implementation shows the potential for improved performance and energy efficiency.