Search | Korea Science

PARALLEL IMPROVEMENT IN STRUCTURED CHIMERA GRID ASSEMBLY FOR PC CLUSTER (PC 클러스터를 위한 정렬 중첩 격자의 병렬처리)

Kim, Eu-Gene;Kwon, Jang-Hyuk
- 한국전산유체공학회:학술대회논문집
- /
- 2005.10a
- /
- pp.157-162
- /
- 2005
Parallel implementation and performance assessment of the grid assembly in a structured chimera grid approach is studied. The grid assembly process, involving hole cutting and searching donor, is parallelized on the PC cluster. A message passing programming model based on the MPI library is implemented using the single program multiple data(SPMD) paradigm. The coarse-grained communication is optimized with the minimized memory allocation because that the parallel grid assembly can access the decomposed geometry data in other processors by only message passing in the distributed memory system such as a PC cluster. The grid assembly workload is based on the static load balancing tied to flow solver. A goal of this work is a development of parallelized grid assembly that is suited for handling multiple moving body problems with large grid size.
PDF

Analysis on the GPU Performance according to Hierarchical Memory Organization (계층적 메모리 구성에 따른 GPU 성능 분석)

Choi, Hongjun;Kim, Jongmyon;Kim, Cheolhong
- The Journal of the Korea Contents Association
- /
- v.14 no.3
- /
- pp.22-32
- /
- 2014
Recently, GPGPU has been widely used for general-purpose processing as well as graphics processing by providing optimized hardware for parallel processing. Memory system has big effects on the performance of parallel processing units such as GPU. In the GPU, hierarchical memory architecture is implemented for high memory bandwidth. Moreover, both memory address coalescing and memory request merging techniques are widely used. This paper analyzes the GPU performance according to various memory organizations. According to our simulation results, GPU performance improves by 15.5%, 21.5%, 25.5%, 30.9% as adding 8KB L1, 16KB L1, 32KB L1, 64KB L1 cache, respectively, compared to case without L1 cache. However, experimental results show that some benchmarks decrease performance since memory transaction increases due to data dependency. Moreover, average memory access latency is increased as the depth of hierarchical cache level increases when cache miss occurs significantly.
https://doi.org/10.5392/JKCA.2014.14.03.022 인용 PDF KSCI

Quad-functional Built-in Test Circuit for DRAM-frame-memory Embedded SOG-LCD

Takatori, Kenichi;Haga, Hiroshi;Nonaka, Yoshihiro;Asada, Hideki
- 한국정보디스플레이학회:학술대회논문집
- /
- 2008.10a
- /
- pp.914-917
- /
- 2008
A quad-functional built-in test circuit has been developed for DRAM-frame-memory embedded SOG-LCDs. The quad function consists of memory test, display test, serial transfer test, and parallel transfer test which is the normal operation mode for our SOG-LCD. Results of memory and display tests are shown.
PDF

Parallel Approximate String Matching with k-Mismatches for Multiple Fixed-Length Patterns in DNA Sequences on Graphics Processing Units (GPU을 이용한 다중 고정 길이 패턴을 갖는 DNA 시퀀스에 대한 k-Mismatches에 의한 근사적 병열 스트링 매칭)

Ho, ThienLuan;Kim, HyunJin;Oh, SeungRohk
- The Transactions of The Korean Institute of Electrical Engineers
- /
- v.66 no.6
- /
- pp.955-961
- /
- 2017
In this paper, we propose a parallel approximate string matching algorithm with k-mismatches for multiple fixed-length patterns (PMASM) in DNA sequences. PMASM is developed from parallel single pattern approximate string matching algorithms to effectively calculate the Hamming distances for multiple patterns with a fixed-length. In the preprocessing phase of PMASM, all target patterns are binary encoded and stored into a look-up memory. With each input character from the input string, the Hamming distances between a substring and all patterns can be updated at the same time based on the binary encoding information in the look-up memory. Moreover, PMASM adopts graphics processing units (GPUs) to process the data computations in parallel. This paper presents three kinds of PMASM implementation methods in GPUs: thread PMASM, block-thread PMASM, and shared-mem PMASM methods. The shared-mem PMASM method gives an example to effectively make use of the GPU parallel capacity. Moreover, it also exploits special features of the CUDA (Compute Unified Device Architecture) memory structure to optimize the performance. In the experiments with DNA sequences, the proposed PMASM on GPU is 385, 77, and 64 times faster than the traditional naive algorithm, the shift-add algorithm and the single thread PMASM implementation on CPU. With the same NVIDIA GPU model, the performance of the proposed approach is enhanced up to 44% and 21%, compared with the naive, and the shift-add algorithms.
https://doi.org/10.5370/KIEE.2017.66.6.955 인용 PDF KSCI

Parallel Integration for Real-Time Simulation (실시간 시뮬레이션을 위한 병렬적분)

Lee, W.S.;Samson, J.
- Transactions of the Korean Society of Automotive Engineers
- /
- v.2 no.1
- /
- pp.106-115
- /
- 1994
A parallel integration approach is proposed for real-time simulation of controlled mechanical systems. The proposed approach, which employs the dual-rate integration method in a parallel computing environment, is developed to deal with stiffness and high frequency characteristics of the controlled mechanical systems effectively. Numerical experiments are performed to demonstrate the effectiveness of the approach in shared memory multiprocessors, Alliant FX/8 and Alliant FX/80.
PDF

An Efficient Data Distribution Method on a Distributed Shared Memory Machine (분산공유 메모리 시스템 상에서의 효율적인 자료분산 방법)

Min, Ok-Gee
- The Transactions of the Korea Information Processing Society
- /
- v.3 no.6
- /
- pp.1433-1442
- /
- 1996
Data distribution of SPMD(Single Program Multiple Data) pattern is one of main features of HPF (High Performance Fortran). This paper describes design is sues for such data distribution and its efficient execution model on TICOM IV computer, named SPAX(Scalable Parallel Architecture computer based on X-bar network). SPAX has a hierarchical clustering structure that uses distributed shared memory(DSM). In such memory structure, it cannot make a full system utilization to apply unanimously either SMDD(shared Memory Data Distribution) or DMDD(Distributed Memory Data Distribution). Here we propose another data distribution model, called DSMDD(Distributed Shared Memory Data Distribution), a data distribution model based on hierarchical masters-slaves scheme. In this model, a remote master and slaves are designated in each node, shared address scheme is used within a node and message passing scheme between nodes. In our simulation, assuming a node size in which system performance degradation is minimized,DSMDD is more effective than SMDD and DMDD. Especially,the larger number of logical processors and the less data dependency between distributed data,the better performace is obtained.
PDF

Design of Memory Test Circuit for Sliding Diagonal Patterns (Sliding diagonal Pattern에 의한 Memory Test circuit 설계)

김대환;설병수;김대용;유영갑
- Journal of the Korean Institute of Telematics and Electronics A
- /
- v.30A no.1
- /
- pp.8-15
- /
- 1993
A concrete disign of memory circuit is presented aiming at the application of sliding diagonal test patterns. A modification of sliding diagonal test pattern includes the complexity reduction from O(n$^{32}$) to O(n) using parallel test memory concept. The control circuit design was based on delay-element, and verified via logic and circuit simulation. Area overhead was evaluated based on physical layout using a 0.7 micron design rule resulting in about 1% area increase for a typical 16Mbit DRAM.
PDF

Design of a Dingle-chip Multiprocessor with On-chip Learning for Large Scale Neural Network Simulation (대규모 신경망 시뮬레이션을 위한 칩상 학습가능한 단일칩 다중 프로세서의 구현)

김종문;송윤선;김명원
- Journal of the Korean Institute of Telematics and Electronics B
- /
- v.33B no.2
- /
- pp.149-158
- /
- 1996
In this paper we describe designing and implementing a digital neural chip and a parallel neural machine for simulating large scale neural netsorks. The chip is a single-chip multiprocessor which has four digiral neural processors (DNP-II) of the same architecture. Each DNP-II has program memory and data memory, and the chip operates in MIMD (multi-instruction, multi-data) parallel processor. The DNP-II has the instruction set tailored to neural computation. Which can be sed to effectively simulate various neural network models including on-chip learning. The DNP-II facilitates four-way data-driven communication supporting the extensibility of parallel systems. The parallel neural machine consists of a host computer, processor boards, a buffer board and an interface board. Each processor board consists of 8*8 array of DNP-II(equivalently 2*2 neural chips). Each processor board acn be built including linear array, 2-D mesh and 2-D torus. This flexibility supports efficiency of mapping from neural network models into parallel strucgure. The neural system accomplishes the performance of maximum 40 GCPS(giga connection per second) with 16 processor boards.
PDF

Parallel BCH Encoding/decoding Method and VLSI Design for Nonvolatile Memory (비휘발성 메모리를 위한 병렬 BCH 인코딩/디코딩 방법 및 VLSI 설계)

Lee, Sang-Hyuk;Baek, Kwang-Hyun
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.47 no.5
- /
- pp.41-47
- /
- 2010
This paper has proposed parallel BCH, one of error correction coding methods which has been used to NAND flash memory for SSD(solid state disk). To alter error correction capability, the proposed design improved reliability on data block has higher error rate as used frequency increasingly. Decoding parallel process bit width is as two times as encoding parallel process bit width, that could reduce decoding processing time, accordingly resulting in one half reduction over conventional ECC.
PDF KSCI

CUDA based parallel design of a shot change detection algorithm using frame segmentation and object movement

Kim, Seung-Hyun;Lee, Joon-Goo;Hwang, Doo-Sung
- Journal of the Korea Society of Computer and Information
- /
- v.20 no.7
- /
- pp.9-16
- /
- 2015
This paper proposes the parallel design of a shot change detection algorithm using frame segmentation and moving blocks. In the proposed approach, the high parallel processing components, such as frame histogram calculation, block histogram calculation, Otsu threshold setting function, frame moving operation, and block histogram comparison, are designed in parallel for NVIDIA GPU. In order to minimize memory access delay time and guarantee fast computation, the output of a GPU kernel becomes the input data of another kernel in a pipeline way using the shared memory of GPU. In addition, the optimal sizes of CUDA processing blocks and threads are estimated through the prior experiments. In the experimental test of the proposed shot change detection algorithm, the detection rate of the GPU based parallel algorithm is the same as that of the CPU based algorithm, but the average of processing time speeds up about 6~8 times.
https://doi.org/10.9708/jksci.2015.20.7.009 인용 PDF KSCI

Search Result 608, Processing Time 0.023 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)