Search | Korea Science

Fast and Efficient Implementation of Neural Networks using CUDA and OpenMP (CUDA와 OPenMP를 이용한 빠르고 효율적인 신경망 구현)

Park, An-Jin;Jang, Hong-Hoon;Jung, Kee-Chul
- Journal of KIISE:Software and Applications
- /
- v.36 no.4
- /
- pp.253-260
- /
- 2009
Many algorithms for computer vision and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation has two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job that needs much cooperation between CPU and GPU, which is usual in image processing and pattern recognition contrary to the graphic area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text extraction system using the proposed architecture, and the computational times showed about 15 times faster than implementation on only GPU without OpenMP.
PDF KSCI

Studies of Parallelism and Performance Enhancements of Computing View Factor for Satellite Thermal Analysis (인공위성 열해석을 위한 복사형상계수 계산기법의 병렬화 및 성능향상 기법 연구)

Kim, Min-Ki
- Journal of the Korean Society for Aeronautical & Space Sciences
- /
- v.43 no.12
- /
- pp.1079-1088
- /
- 2015
Parallelism and performance enhancement of calculating view factors in KSDS developed by KARI is introduced in this paper. View factor is an essential parameters of radiation thermal analysis for a spacecraft, and the amount of computation of them is not negligible. Especially, independent integration of view factors at each position of the orbit because the relative displace between solar panel and main body of a satellite varies with the position on the orbit. This paper introduces a range of parallelism of computing view factor and their performance, detection of obstructions by spatial search algorithm based on KD-Tree, and the reduction of the calculation of view factors of a satellite with relative motion between solar panel and main body, called updating fractional view factor matrix, for satellite thermal analysis.
https://doi.org/10.5139/JKSAS.2015.43.12.1079 인용 PDF KSCI

The 3-Dimensional Visualization in Shared-Memory Programs with Nested Parallelism (내포 병렬성을 가진 공유메모리 프로그램의 3차원 시각화)

Park, Myeong-Chul;Hur, Hwa-Ra;Ha, Seok-Wun
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.12 no.1
- /
- pp.53-58
- /
- 2008
A pellet program including a nested parallelism has a result of non-deterministic because of executed concurrently without synchronization. In order to detect like this error the visualization technique which is various is used. But the intuition characteristic is decreased because of limits of space and excessive abstraction. In this paper, proposes 3-D visualization engines which provide global structure of the arranging in a parallel program with nested parallelism which is complicated to the user. The visualization engine which is proposed provides global structure to the user as program easily to understand, it provides an effective debugging environment.
https://doi.org/10.6109/JKIICE.2008.12.1.53 인용 PDF KSCI

The Effect of Mesh Reordering on Laplacian Smoothing for Nonuniform Memory Access Architecture-based High Performance Computing Systems (NUMA구조를 가진 고성능 컴퓨팅 시스템에서의 메쉬 재배열의 라플라시안 스무딩에 대한 효과)

Kim, Jbium
- Journal of the Institute of Electronics and Information Engineers
- /
- v.51 no.3
- /
- pp.82-88
- /
- 2014
We study the effect of mesh reordering on Laplacian smoothing for parallel high performance computing systems. Specifically, we use the Reverse-Cuthill McKee algorithm to reorder meshes and use Laplacian Smoothing to improve the mesh quality on Nonuniform memory access architecture-based parallel high performance computing systems. First, we investigate the effect of using mesh reordering on Laplacian smoothing for a single core system and extend the idea to NUMA-based high performance computing systems.
https://doi.org/10.5573/ieie.2014.51.3.082 인용 PDF KSCI

인라인 타입 마그네트론 스퍼터링 장치에서 증착 두께 분포 병렬 계산

Ju, Jeong-Hun
- Proceedings of the Korean Vacuum Society Conference
- /
- 2014.02a
- /
- pp.225-225
- /
- 2014
일반적인 Cosine law를 이용한 증착 두께의 분포에 대한 계산은 적분의 형태로 이루어져있다. LCD 8G 급의 경우 마그네트론 스퍼터링 타겟의 크기가 깊이 3 m, 폭 25 cm정도인데 대략 6~8개를 설치하여 공정 시간을 줄이고 있다. 이 때 한 쪽 방향으로 이동하는 기판이 타겟 표면과 이루는 각도는 아주 작은 각에서 수직으로 다시 음의 각도로 변화한다. 이 때 발생하는 박막의 미세 조직 변화는 박막 특성에 많은 영향을 준다. 이에 대한 연구를 위한 1단계로 타겟 표면과 기판 표면을 모두 미소 면적소로 구분하고 각각의 면적소 간에 이루어지는 증착 원자의 비행을 충돌이 없다는 가정하에 direct flux 알고리즘으로 처리하였다. 이 때 소요되는 계산 시간은 매우 길어서 single core CPU에서 serial job으로 처리하는 경우 여러 시간이 소요된다. 이에 대한 대안으로 OpenMP를 이용한 작업의 병렬화를 시도하였다. 4 core machine에서 최대 96%의 병렬 효율을 달성하였다.
PDF

Parallelizing 3D Frequency-domain Acoustic Wave Propagation Modeling using a Xeon Phi Coprocessor (제온 파이 보조 프로세서를 이용한 3차원 주파수 영역 음향파 파동 전파 모델링 병렬화)

Ryu, Donghyun;Jo, Sang Hoon;Ha, Wansoo
- Geophysics and Geophysical Exploration
- /
- v.20 no.3
- /
- pp.129-136
- /
- 2017
3D seismic data processing methods such as full waveform inversion or reverse-time migration require 3D wave propagation modeling and heavy calculations. We compared efficiency and accuracy of a Xeon Phi coprocessor to those of a high-end server CPU using 3D frequency-domain wave propagation modeling. We adopted the OpenMP parallel programming to the time-domain finite difference algorithm by considering the characteristics of the Xeon Phi coprocessors. We applied the Fourier transform using a running-integration to obtain the frequency-domain wavefield. A numerical test on frequency-domain wavefield modeling was performed using the 3D SEG/EAGE salt velocity model. Consequently, we could obtain an accurate frequency-domain wavefield and attain a 1.44x speedup using the Xeon Phi coprocessor compared to the CPU.
https://doi.org/10.7582/GGE.2017.20.3.129 인용 PDF KSCI

Benchmarking on High-speed Image Processing Techniques based on Multi-processor (멀티프로세서 기반의 고속 영상처리 기술에 대한 벤치마킹)

Cui, Xue-Nan;Park, Eun-Soo;Kim, Jun-Chul;Kim, Hak-Il
- Proceedings of the KIEE Conference
- /
- 2007.10a
- /
- pp.111-112
- /
- 2007
본 논문에서는 멀티프로세서 기반의 고속 영상처리 알고리즘 개발방법에 대해 소개한다. 영상획득 방식의 발전과 더불어 고해상도 영상의 획득이 가능해지고 영상이 컬러화가 되면서 많은 영상처리 응용분야에서 알고리즘 고속화를 필요로 하고 있다. 이러한 수요를 만족시키기 위해서는 최근에 출시되고 있는 멀티프로세서를 최대한 활용할 수 있는 알고리즘 개발이 최우선이다. 본 논문에서는 OpenMP, MIL(Matrox Image Library), OpenCV, IPP(Integrated Performance Primitives), SSE (Streaming SIMD (Single Instruction Multiple Data) Extensions)등 병렬처리와 고속 영상처리 라이브러리를 이용한 알고리즘 개발방법에 대해 소개하고, 각 개발방법에 따른 알고리즘 성능을 분석 및 평가하였다. 실험결과로부터 SSE와 IPP, MIL(Thread)을 이용하여 Mean, Dilation, Erosion, Open, Closing, Sobel등 알고리즘을 구현하여 $4057{\times}4048$크기의 영상에 적용하였을 때 $7{\sim}35msec$의 좋은 성능을 나타내어 기타 방식보다 우수함을 알 수 있었다.
PDF

Parallel processing and GPU-accelerated processing of UHD sequence using HEVC (HEVC를 이용한 UHD 영상의 CPU 병렬처리 및 GPU 가속처리)

Hong, Sung-Wook;Lee, Yung-Lyul
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2013.06a
- /
- pp.409-410
- /
- 2013
동영상 압축 기술 HEVC(High Efficiency Video Coding)는 ITU-T(VCEG)와 ISO-IEC(MPEG)에서 JCT-VC라는 팀을 이루어 공동으로 표준화를 완성 단계에 이르렀다. 기존 표준보다 약 50%의 성능 향상을 가져왔지만 다양한 최신 압축 기술을 사용함에 따라 부호화 및 보호화의 복잡도가 매우 복잡한 단점을 가진다. 제안하는 방법은 슬라이스 단위의 프로세싱을 OpenMP를 통한 병렬 구조를 적용하는 방법과 GPU 가속 모델을 적용한 방법을 통해 고화질 영상의 실시간 부호화 및 복호화에 대해 분석한다.
PDF

Data Level Parallelism for H.264/AVC Decoder on a Multi-Core Processor and Performance Analysis (멀티코어 프로세서에서의 H.264/AVC 디코더를 위한 데이터 레벨 병렬화 성능 예측 및 분석)

Cho, Han-Wook;Jo, Song-Hyun;Song, Yong-Ho
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.46 no.8
- /
- pp.102-116
- /
- 2009
There have been lots of researches for H.264/AVC performance enhancement on a multi-core processor. The enhancement has been performed through parallelization methods. Parallelization methods can be classified into a task-level parallelization method and a data level parallelization method. A task-level parallelization method for H.264/AVC decoder is implemented by dividing H.264/AVC decoder algorithms into pipeline stages. However, it is not suitable for complex and large bitstreams due to poor load-balancing. Considering load-balancing and performance scalability, we propose a horizontal data level parallelization method for H.264/AVC decoder in such a way that threads are assigned to macroblock lines. We develop a mathematical performance expectation model for the proposed parallelization methods. For evaluation of the mathematical performance expectation, we measured the performance with JM 13.2 reference software on ARM11 MPCore Evaluation Board. The cycle-accurate measurement with SoCDesigner Co-verification Environment showed that expected performance and performance scalability of the proposed parallelization method was accurate in relatively high level
PDF KSCI

Multi-Threaded Parallel H.264/AVC Decoder for Multi-Core Systems (멀티코어 시스템을 위한 멀티스레드 H.264/AVC 병렬 디코더)

Kim, Won-Jin;Cho, Keol;Chung, Ki-Seok
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.47 no.11
- /
- pp.43-53
- /
- 2010
Wide deployment of high resolution video services leads to active studies on high speed video processing. Especially, prevalent employment of multi-core systems accelerates researches on high resolution video processing based on parallelization of multimedia software. In this paper, we propose a novel parallel H.264/AVC decoding scheme on a multi-core platform. Parallel H.264/AVC decoding is challenging not only because parallelization may incur significant synchronization overhead but also because software may have complicated dependencies. To overcome such issues, we propose a novel approach called Multi-Threaded Parallelization(MTP). In MTP, to reduce synchronization overhead, a separate thread is allocated to each stage in the pipeline. In addition, an efficient memory reuse technique is used to reduce the memory requirement. To verify the effectiveness of the proposed approach, we parallelized FFmpeg H.264/AVC decoder with the proposed technique using OpenMP, and carried out experiments on an Intel Quad-Core platform. The proposed design performs better than FFmpeg H.264/AVC decoder before the parallelization by 53%. We also reduced the amount of memory usage by 65% and 81% for a high-definition(HD) and a full high-definition(FHD) video, respectively compared with that of popular existing method called 2Dwave.
PDF KSCI

Search Result 44, Processing Time 0.031 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)