• 제목/요약/키워드: Massively parallel execution

검색결과 11건 처리시간 0.018초

Design and Implementation of a Massively Parallel Multithreaded Architecture: DAVRID

  • Sangho Ha;Kim, Junghwan;Park, Eunha;Yoonhee Hah;Sangyong Han;Daejoon Hwang;Kim, Heunghwan;Seungho Cho
    • Journal of Electrical Engineering and information Science
    • /
    • 제1권2호
    • /
    • pp.15-26
    • /
    • 1996
  • MPAs(Massively Parallel Architectures) should address two fundamental issues for scalability: synchronization and communication latency. Dataflow architecture faces problems of excessive synchronization overhead and inefficient execution of sequential programs while they offer the ability to exploit massive parallelism inherent in programs. In contrast, MPAs based on von Neumann computational model may suffer from inefficient synchronization mechanism and communication latency. DAVRID (DAtaflow/Von Neumann RISC hybrID) is a massively parallel multithreaded architecture which takes advantages of von Neumann and dataflow models. It has good single thread performance as well as tolerates synchronization and communication latency. In this paper, we describe the DAVRID architecture in detail and evaluate its performance through simulation runs over several benchmarks.

  • PDF

Development of a drift-flux model based core thermal-hydraulics code for efficient high-fidelity multiphysics calculation

  • Lee, Jaejin;Facchini, Alberto;Joo, Han Gyu
    • Nuclear Engineering and Technology
    • /
    • 제51권6호
    • /
    • pp.1487-1503
    • /
    • 2019
  • The methods and performance of a pin-level nuclear reactor core thermal-hydraulics (T/H) code ESCOT employing the drift-flux model are presented. This code aims at providing an accurate yet fast core thermal-hydraulics solution capability to high-fidelity multiphysics core analysis systems targeting massively parallel computing platforms. The four equation drift-flux model is adopted for two-phase calculations, and numerical solutions are obtained by applying the Finite Volume Method (FVM) and the Semi-Implicit Method for Pressure-Linked Equation (SIMPLE)-like algorithm in a staggered grid system. Constitutive models involving turbulent mixing, pressure drop, and vapor generation are employed to simulate key phenomena in subchannel-scale analyses. ESCOT is parallelized by a domain decomposition scheme that involves both radial and axial decomposition to enable highly parallelized execution. The ESCOT solutions are validated through the applications to various experiments which include CNEN $4{\times}4$, Weiss et al. two assemblies, PNNL $2{\times}6$, RPI $2{\times}2$ air-water, and PSBT covering single/two-phase and unheated/heated conditions. The parameters of interest for validation include various flow characteristics such as turbulent mixing, spacer grid pressure drop, cross-flow, reverse flow, buoyancy effect, void drift, and bubble generation. For all the validation tests, ESCOT shows good agreements with measured data in the extent comparable to those of other subchannel-scale codes: COBRA-TF, MATRA and/or CUPID. The execution performance is examined with a mini-sized whole core consisting of 89 fuel assemblies and for an OPR1000 core. It turns out that it is about 1.5 times faster than a subchannel code based on the two-fluid three field model and the axial domain decomposition scheme works as well as the radial one yielding a steady-state solution for the OPR1000 core within 30 s with 104 processors.

Modeling, Discovering, and Visualizing Workflow Performer-Role Affiliation Networking Knowledge

  • Kim, Haksung;Ahn, Hyun;Kim, Kwanghoon Pio
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제8권2호
    • /
    • pp.691-708
    • /
    • 2014
  • This paper formalizes a special type of social networking knowledge, which is called "workflow performer-role affiliation networking knowledge." A workflow model specifies execution sequences of the associated activities and their affiliated relationships with roles, performers, invoked-applications, and relevant data. In Particular, these affiliated relationships exhibit a stream of organizational work-sharing knowledge and utilize business process intelligence to explore resources allotting and planning knowledge concealed in the corresponding workflow model. In this paper, we particularly focus on the performer-role affiliation relationships and their implications as organizational and business process intelligence in workflow-driven organizations. We elaborate a series of theoretical formalisms and practical implementation for modeling, discovering, and visualizing workflow performer-role affiliation networking knowledge, and practical details as workflow performer-role affiliation knowledge representation, discovery, and visualization techniques. These theoretical concepts and practical algorithms are based upon information control net methodology for formally describing workflow models, and the affiliated knowledge eventually represents the various degrees of involvements and participations between a group of performers and a group of roles in a corresponding workflow model. Finally, we summarily describe the implications of the proposed affiliation networking knowledge as business process intelligence, and how worthwhile it is in discovering and visualizing the knowledge in workflow-driven organizations and enterprises that produce massively parallel interactions and large-scaled operational data collections through deploying and enacting massively parallel and large-scale workflow models.

Algorithmic GPGPU Memory Optimization

  • Jang, Byunghyun;Choi, Minsu;Kim, Kyung Ki
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • 제14권4호
    • /
    • pp.391-406
    • /
    • 2014
  • The performance of General-Purpose computation on Graphics Processing Units (GPGPU) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on GPUs and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-pattern-aware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPGPU workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.

대기질 예보의 성능 향상을 위한 커널 삼중대각 희소행렬을 이용한 고속 자료동화 (Fast Data Assimilation using Kernel Tridiagonal Sparse Matrix for Performance Improvement of Air Quality Forecasting)

  • 배효식;유숙현;권희용
    • 한국멀티미디어학회논문지
    • /
    • 제20권2호
    • /
    • pp.363-370
    • /
    • 2017
  • Data assimilation is an initializing method for air quality forecasting such as PM10. It is very important to enhance the forecasting accuracy. Optimal interpolation is one of the data assimilation techniques. It is very effective and widely used in air quality forecasting fields. The technique, however, requires too much memory space and long execution time. It makes the PM10 air quality forecasting difficult in real time. We propose a fast optimal interpolation data assimilation method for PM10 air quality forecasting using a new kernel tridiagonal sparse matrix and CUDA massively parallel processing architecture. Experimental results show the proposed method is 5~56 times faster than conventional ones.

다중 프로세서 칩을 위한 시스템 제어 장치의 구조설계 및 FPGA 구현 (Architecture design and FPGA implementation of a system control unit for a multiprocessor chip)

  • 박성모;정갑천
    • 전자공학회논문지C
    • /
    • 제34C권12호
    • /
    • pp.9-19
    • /
    • 1997
  • This paper describes the design and FPGA implementation of a system control unit within a multiprocessor chip which can be used as a node processor ina massively parallel processing (MPP) caches, memory management units, a bus unit and a system control unit. Major functions of the system control unit are locking/unlocking of the shared variables of protected access, synchronization of instruction execution among four integer untis, control of interrupts, generation control of processor's status, etc. The system control unit was modeled in very high level using verilog HDL. Then, it was simulated and verified in an environment where trap handler and external interrupt controller were added. Functional blocks of the system control unit were changed into RTL(register transfer level) model and synthesized using xilinx FPGA cell library in synopsys tool. The synthesized system control unit was implemented by Xilinx FPGA chip (XC4025EPG299) after timing verification.

  • PDF

CUDA를 이용한 FDTD 알고리즘의 병렬처리 (Parallel Computation of FDTD algorithm using CUDA)

  • 이호영;박종현;김준성
    • 전자공학회논문지CI
    • /
    • 제47권4호
    • /
    • pp.82-87
    • /
    • 2010
  • CPU를 능가하는 GPU의 연산능력 향상으로 범용 계산에 그래픽 프로세서를 사용하는 GP-GPU연구가 활발히 전개되고 있으며, 그 응용분야가 확대되고 있다. 본 논문에서는 전자기학 관련 분야에서 널리 사용되는 FDTD 알고리즘을 nVIDIA에서 제공하는 소프트웨어 플랫폼인 CUDA를 사용하여 구현한다. FDTD 알고리즘의 주요 연산과정을 병렬화하고, 그래픽 카드 내각기 다른 메모리의 사용에 따라 최적화하며, 단일 프로세서에서 FDTD 알고리즘을 실행시킨 경우와 비교하여 그 성능 향상 정도를 측정한다. 실험결과 단일 프로세서로 구현하였을 때에 비해 실행시간이 45배까지 향상됨을 확인할 수 있었다.

멀티그리드 방법을 이용한 프로펠러 주위의 비압축성 층류유동 계산 (Numerical Simulation of Incompressible Laminar Flow around a Propeller Using the Multigrid Technique)

  • 박원규
    • 대한조선학회논문집
    • /
    • 제31권4호
    • /
    • pp.41-50
    • /
    • 1994
  • 프로펠러 주위의 비압축성 점성유동을 해석하기 위해 멀티그리드 방법을 이용한 Iterative time marching 방법이 적용되었다. 이 방법은 3차원 비압축성 Navier-Stokes 방정식을 움직이는 비직교 일반 좌표계상에서 풀고 있으며, 시간에 대해서는 1차의 정확도 그리고 공간에 대해서는 2차 또는 3차의 정확도를 가지고 있으며 반복계산의 수렴속도를 가속시키기 위해서 멀티그리드방법을 사용하였다. 또한 본 방법은 Vector나 Parallel컴퓨터에 적용이 매우 간편하다는 장점을 가지고 있다. 본 연구 결과와 실험치 혹은 다른 연구자의 계산 결과와 일반적으로 잘 일치하고 있으며, 멀티그리드 방법은 수렴에 필요한 CPU시간을 단축시키고 해의 정확도도 개선함을 보여주었다.

  • PDF

벡타 연산을 효율적으로 수행하기 위한 다중 스레드 구조 (A Multithreaded Architecture for the Efficient Execution of Vector Computations)

  • 윤성대;정기동
    • 한국정보처리학회논문지
    • /
    • 제2권6호
    • /
    • pp.974-984
    • /
    • 1995
  • 본 논문에서는 벡타연산을 효율적으로 수행하고 대단위 병렬시스템을 지원하는 다중 스레드구조, MULVEC(MULtithreaded architecture of the VEctor Computations) 을 제시한다. MULVEC은 데이타플로우 모델에 수퍼 스칼라 RISC 마이크로 프로세서를 갖는 기존의 폰 노이만 모델을 도입하였다. 그리고 동일한 스레드 세그멘트내에 벡타 연산이 반복되는 경우에 상태필드를 이용하여 동기화의 수를 감축시켰으며, 이에 의해 문맥전환 횟수, 통신량 등을 감소시켰다. 그리고 노드 수의 변화에 대한 MULVEC의 성능평가(프로그램들의 수행시간, 프로세서들의 이용율)와 *T의 성능평가(프로그램의 수행시간)를 SPARC station 20 (super scalar RISC microprocessor)에서 시뮬레이션을 하였으며, 노드의 수, 루프의 반복홋수 등에 따라 프로그램의 수행시간이 MULVEC이 *T보 다 약 1-2배 정도 빠르다는 것을 알 수 있었다.

  • PDF

CUDA를 이용한 고속 영상 회전 알고리즘에 관한 연구 (A Study on High Speed Image Rotation Algorithm using CUDA)

  • 권희철;조형진;권희용
    • 한국인터넷방송통신학회논문지
    • /
    • 제16권5호
    • /
    • pp.1-6
    • /
    • 2016
  • 영상 회전은 영상 처리나 영상 패턴 인식에서 중요한 전처리 방법 중 하나이다. 영상 회전은 회전 행렬의 곱으로 이루어 진다. 그러나 기존의 방법은 대량의 실수 연산과 삼각 함수 계산을 필요로 하므로 수행 시간이 오래 걸린다. 본 논문에서는 이 같은 두가지 주요 지체 연산과정을 제거한 새로운 고속 영상 회전 알고리즘을 제안한다. 제안된 알고리즘은 단지 2개의 전단 연산을 행하므로 매우 빠르다. 또한 최신 병렬 처리 기술인 CUDA를 적용한다. CUDA는 최근 널리 보급된 GPU를 이용한 대용량 병렬처리 계산 아키텍쳐이다. GPGPU는 그래픽 전용프로세서이므로 화소 단위의 병렬처리에 탁월한 성능을 보인다. 제안된 알고리즘은 기존의 회전 알고리즘과 다양한 크기의 영상에 대해 비교 실험한다. 실험 결과는 제안된 알고리즘이 기존의 방법보다 8배 이상의 매우 우수한 성능을 보인다.