• Title/Summary/Keyword: 병렬성능

Search Result 1,943, Processing Time 0.028 seconds

Performance Analysis on Parallel Processing of a Hybrid of a CPU and a GPU (CPU와 GPU의 혼합 병렬 계산에 대한 성능 분석)

  • Hwang, Keunchang;Kim, Youngtae
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2016.04a
    • /
    • pp.59-60
    • /
    • 2016
  • 본 논문에서는 고성능 병렬 계산 장치로 주목받고 있는 GPU를 CPU와 동시에 병렬로 사용한 계산 성능을 분석하였다. 성능 분석을 위하여 원주율(${\pi}$)을 적분으로 계산하는 CUDA 프로그램을 사용하였으며, 전체 계산을 GPU 대비 CPU 계산 부분으로 할당하여 성능을 분석하였다.

Design of Parallel CBF(Cel1-Based Filtering) Scheme using Horizontal1y-Partitioned Method (수평 분할 방법을 이용한 병렬 CBF(Cell-Based Filtering) 기법의 설계)

  • 김남기;장재우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2001.10a
    • /
    • pp.70-72
    • /
    • 2001
  • 기존의 CBF 기법은 데이타의 차원이 증가함에 따라 검색 성능이 급격히 저하되는 ‘Dimensional Curse’문제를 해결하기 위해 제안되었다. 그러나, 데이타의 양이 증가하고 차원이 증가할수록 검색 성능이 선형적인 감소를 보인다. 따라서, 본 논문에서는 CBF 기법의 성능 향상을 위해 멀티 디스크 환경을 기반으로 하는 병렬 CBF 기법을 제안한다. 제안하는 병렬 CBF 기법은 멀티 디스크 환경하에서 CBF가 지니는 특성을 이용하여 시그니쳐와 특징 벡터 데이타의 수평 분할 방법을 사용한다. 이를 통해, 제안하는 기법은 디스크 개수에 비례하여 선형적인 검색성능 향상을 가져온다.

  • PDF

The Parallel ANN(Artificial Neural Network) Simulator using Mobile Agent (이동 에이전트를 이용한 병렬 인공신경망 시뮬레이터)

  • Cho, Yong-Man;Kang, Tae-Won
    • The KIPS Transactions:PartB
    • /
    • v.13B no.6 s.109
    • /
    • pp.615-624
    • /
    • 2006
  • The objective of this paper is to implement parallel multi-layer ANN(Artificial Neural Network) simulator based on the mobile agent system which is executed in parallel in the virtual parallel distributed computing environment. The Multi-Layer Neural Network is classified by training session, training data layer, node, md weight in the parallelization-level. In this study, We have developed and evaluated the simulator with which it is feasible to parallel the ANN in the training session and training data parallelization because these have relatively few network traffic. In this results, we have verified that the performance of parallelization is high about 3.3 times in the training session and training data. The great significance of this paper is that the performance of ANN's execution on virtual parallel computer is similar to that of ANN's execution on existing super-computer. Therefore, we think that the virtual parallel computer can be considerably helpful in developing the neural network because it decreases the training time which needs extra-time.

Performance Comparison of Join Operations Parallelization by using GPGPU (GPGPU 기반 조인 연산 병렬화 성능 비교)

  • Lee, Jong-Sub;Lee, Sang-Back;Lee, Kyu-Chul
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.28-44
    • /
    • 2018
  • In a database system, the most expensive operation among relational operations is a join operation. Generally, CPU-based join operations uses parallel processing with either 1 core or 16 cores at most, which does not significantly improve the function. On the other hand, GPGPU(General-Purpose computing on Graphics Processing Units) allows parallel processing through thousands of processing units, greatly reducing the time required to perform join operations. Parallelization of the operation using GPGPU uses NVIDIA's CUDA SDK. In this paper, we implement parallelization of the join operation using GPGPU and compare the performances. The used join operations are Nested Loop Join (NLJ), Sort Merge Join (SMJ) and Hash Join (HJ), and GPGPU equipment uses TITAN Xp, GTX 1080 Ti and GTX 1080. We measure and compare the performance of join operations based on CPU and GPGPU. We compare this performance with the performance of the previous study on the join operation based on GPGPU. The results of experiment show that the performance based on GPGPU is 6~328 times faster than the one based on CPU.

Multistage Parallel Nulling-Partial PIC Receiver for Downlink MIMO MC-CDMA Systems (하향링크 다중 안테나 MC-CDMA 시스템을 위한 다단계 병렬 널링 및 병렬 부분 간섭 제거 수신기 설계)

  • 구정회;김경연;심세준;이충용
    • Journal of the Institute of Electronics Engineers of Korea TC
    • /
    • v.41 no.11
    • /
    • pp.1-7
    • /
    • 2004
  • We propose multistage parallel nulling (MPN) partial parallel interference cancellation (PPIC) receiver for downlink multiple-input multiple-output (MIMO) multicarrier (MC)-code division multiple access (CDMA) systems. Though the V-BLAST is a popular MIMO receiver, it shows error floor for multiuser downlink MIMO MC-CDMA systems. The proposed MPN-PPIC receiver does not produce error floor for multiuser case, and achieves substantial performance gains with multistage processing. For single user case, the proposed method also surpasses the V-BLAST receiver with multistage processing for MIMO MC-CDMA systems with chip level interleaving. The system performance of the proposed MPN-PPIC receiver is evaluated through computer simulations.

Join Operation of Parallel Database System with Large Main Memory (대용량 메모리를 가진 병렬 데이터베이스 시스템의 조인 연산)

  • Park, Young-Kyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.12 no.3
    • /
    • pp.51-58
    • /
    • 2007
  • The shared-nothing multiprocessor architecture has advantages in scalability, this architecture has been adopted in many multiprocessor database system. But, if the data are not uniformly distributed across the processors, load will be unbalanced. Therefore, the whole system performance will deteriorate. This is the data skew problem, which usually occurs in processing parallel hash join. Balancing the load before performing join will resolve this problem efficiently and the whole system performance can be improved. In this paper, we will present an algorithm using merit of very large memory to reduce disk access overhead in performing load balancing and to efficiently solve the data skew problem. Also, we will present analytical model of our new algorithm and present the result of some performance study we made comparing our algorithm with the other algorithms in handling data skew.

  • PDF

Parallelization of Multifrontal Solution Method for Shared Memory Architecture (다중프론트 해법의 공유메모리 병렬화)

  • Kim, Min Ki;Kim, Jeong Ho;Park, Chan Yik;Kim, Seung Jo
    • Journal of the Korean Society for Aeronautical & Space Sciences
    • /
    • v.40 no.11
    • /
    • pp.972-978
    • /
    • 2012
  • This paper discusses the parallelization of multifrontal solution method, widely used for finite element structural analyses, for a shared memory architecture. Multifrontal method is easier than other linear solution methods because the solution procedure implies that unknowns can be eliminated simultaneously. Two innovative ideas are introduced to achieve optimal solver performance on a shared memory computer. Those are pairing two frontal matrices and splitting the frontal matrix in order to reduce the temporal memory space required by independent computing tasks. Performance comparisons between original algorithm and proposed one prove that proposed method is more computationally efficient on current multicore machines.

Speedup Analysis Model for High Speed Network based Distributed Parallel Systems (고속 네트웍 기반의 분산병렬시스템에서의 성능 향상 분석 모델)

  • 김화성
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.26 no.12C
    • /
    • pp.218-224
    • /
    • 2001
  • The objective of Distributed Parallel Computing is to solve the computationally intensive problems, which have several types of parallelism, on a suite of high performance and parallel machines in a manner that best utilizes the capabilities of each machine. In this paper, we propose a computational model including the generalized graph representation method of distributed parallel systems for speedup analysis, and analyze how the super-linear speedup is achieved when scheduling of programs with diverse embedded parallelism modes onto a distributed heterogeneous supercomputing network environment. The proposed representation method can also be applied to simple homogeneous or heterogeneous systems whose components are heterogeneous only in terms of the processor speed. In order to obtain the core speedup, the matching of the parallelism characteristics between tasks and parallel machines should be carefully handled while minimizing the communication overhead.

  • PDF

Implementation and analysis of a parallel suffix tree construction algorithm using TBB and Cilk Plus (TBB, Cilk Plus를 이용한 병렬 접미사 트리 생성 알고리즘 구현 및 성능 분석)

  • Seo, Jun-Ho;Na, Joong-Chae
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06a
    • /
    • pp.403-405
    • /
    • 2012
  • 접미사 트리는 문자열 압축, 텍스트 처리, 생물정보학 등 다양한 응용 분야에서 사용되는 인덱스 자료구조이다. 최근 64bit 하드웨어와 멀티코어 CPU가 보급됨에 따라 메모리상에서 병렬로 접미사 트리를 생성하는 알고리즘이 활발히 연구되고 있다. 본 논문에서는 McCreight의 선형시간 알고리즘과 Chen의 병렬 알고리즘을 기반으로 메모리상에서 접미사 트리를 병렬로 생성하는 구현 방법을 보였으며, TBB, Cilk Plus와 같은 병렬 프로그래밍 라이브러리를 이용하여 병렬 알고리즘을 구현하였다. 알고리즘 실험 결과 병렬로 수행한 알고리즘이 직렬로 수행한 결과보다 최대 4배 가량 성능 향상을 얻을 수 있었으며, 병렬 라이브러리를 사용함으로써 가지는 오버헤드는 극히 적은 것으로 나타났다.

Design of Parallel Processing of Lane Detection System Based on Multi-core Processor (멀티코어를 이용한 차선 검출 병렬화 시스템 설계)

  • Lee, Hyo-Chan;Moon, Dai-Tchul;Park, In-hag;Heo, Kang
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.9
    • /
    • pp.1778-1784
    • /
    • 2016
  • we improved the performance by parallelizing lane detection algorithms. Lane detection, as a intellectual assisting system, helps drivers make an alarm sound or revise the handle in response of lane departure. Four kinds of algorithms are implemented in order as following, Gaussian filtering algorithm so as to remove the interferences, gray conversion algorithm to simplify images, sobel edge detection algorithm to find out the regions of lanes, and hough transform algorithm to detect straight lines. Among parallelized methods, the data level parallelism algorithm is easy to design, yet still problem with the bottleneck. The high-speed data level parallelism is suggested to reduce this bottleneck, which resulted in noticeable performance improvement. In the result of applying actual road video of black-box on our parallel algorithm, the measurement, in the case of single-core, is approximately 30 Frames/sec. Furthermore, in the case of octa-core parallelism, the data level performance is approximately 100 Frames/sec and the highest performance comes close to 150 Frames/sec.