Search | Korea Science

A Study on a Declines in Performance by Memory Copy in CUDA (CUDA의 메모리 복사로 인한 성능 저하 연구)

Kang, Jihun;Lee, DaeWon;Kang, InSung;Yu, HeonChang
- Proceedings of the Korea Information Processing Society Conference
- /
- 2013.11a
- /
- pp.135-138
- /
- 2013
GPGPU(General Purpose Graphics Processing Unit) 병렬처리 시스템인 CUDA(Compute Unified Device Architecture)는 컴퓨터에서의 고속 연산 처리를 위해 많이 사용되어왔다. CUDA에서 연산 처리를 하기 위해서는 CUDA의 특성을 이해해야 한다. CUDA는 CPU(Central Processing Unit)가 처리하는 Host 영역과 GPU(Graphics Processing Unit)가 처리하는 영역인 Device 영역이 존재하며, 이 두 영역간의 데이터 복사를 통해 연산 처리를 진행한다. 이런 구조적인 특성상 메인 메모리에서 GPU 메모리로 입력 데이터를 전달해야 GPU를 이용해 연산을 처리할 수 있는 구조를 가지고 있다. 하지만 이러한 처리 구조로 인해 연산 시간과 별도로 메인 메모리와 GPU 메모리간의 데이터 복사시간이 존재하며, 추가적으로 발생하는 메모리 복사 시간으로 인해 오버헤드가 발생하게 된다. 본 논문에서는 실험을 통해 메모리 복사 시간, 연산의 반복 횟수 그리고 연산의 복잡성이 전체 성능에 어떤 영향을 미치는지 논하고자 한다.
https://doi.org/10.3745/PKIPS.y2013m11a.135 인용 PDF

Implementation of H.264/AVC Deblocking Filter on 1-D CGRA (1-D CGRA에서의 H.264/AVC 디블록킹 필터 구현)

Song, Sehyun;Kim, Kichul
- Journal of IKEEE
- /
- v.17 no.4
- /
- pp.418-427
- /
- 2013
In this paper, we propose a parallel deblocking filter algorithm for H.264/AVC video standard. The deblocking filter has different filter processes according to boundary strength (BS) and each filter process requires various conditional calculations. The order of filtering makes it difficult to parallelize deblocking filter calculations. The proposed deblocking filter algorithm is performed on PRAGRAM which is a 1-D coarse grained reconfigurable architecture (CGRA). Each filter calculation is accelerated using uni-directional pipelined architecture of PRAGRAM. The filter selection and the conditional calculations are efficiently performed using dynamic reconfiguration and conditional reconfiguration. The parallel deblocking filter algorithm uses 225 cycles to process a macroblock and it can process a full HD image at 150 MHz.
https://doi.org/10.7471/ikeee.2013.17.4.418 인용 PDF KSCI

An Application-Level Fault Tolerant System For Synchronous Parallel Linear System Solver (선형 시스템의 동기 병렬 연산을 위한 응용 수준의 무정지 연산 시스템)

Park, Pil-Seong
- Proceedings of the Korea Information Processing Society Conference
- /
- 2007.11a
- /
- pp.644-647
- /
- 2007
많은 수의 CPU를 사용해 오랜 시간 계산하는 초대형 연산의 경우, 일부 노드나 통신회선의 장애로 연산 실패를 종종 겪는데, 이를 위해 응용 수준의 무정지 연산 시스템의 구현이 중요하다. 본 논문에서는 비동기 알고리즘을 사용한 이전 시스템의 약점을 보완하여, 동기 알고리즘에도 적용가능한 새로운 응용수준의 무정지 연산 시스템을 제안하고 선형시스템의 해법에 적용하였다.
https://doi.org/10.3745/PKIPS.y2007m11a.644 인용 PDF

A Method to Process Spatial Information in Parallel Spatial DBMS (병렬 공간데이터베이스 시스템에서 공간 정보 처리 방안)

Kim, JinDeog
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2016.05a
- /
- pp.811-812
- /
- 2016
최근 공간 정보는 생산 되는 양과 데이터의 생성 빈도 및 다양성으로 인해 기존의 공간 데이터베이스 시스템에서 처리하기 어렵다. 그래서 공간 정보는 빅데이터와 연계에 관한 시도가 활발히 진행되고 있다. 그러나 효율적인 단일할당, 다중할당 색인기반 공간 연산에 대한 연구는 거의 없다. 이 논문에서는 공간 연산 중 비용이 매우 큰 공간 조인을 빅데이터 시스템에서 처리하기 위한 고려요소를 제시하고자 한다. 구체적으로 맵리듀스 시스템의 태스크 할당을 위한 단일 할당 공간 색인방안을 설명하고, 불균일 분포가 심한 공간 정보의 특성을 고려한 부하 균등화 시 고려 요소를 제시하고자 한다. 맵리듀스와 같은 병렬 공간 데이터베이스 시스템에서의 두 가지 문제인 데이터 불균일 분포 문제와 경계 겹침 색인의 문제와의 연관성을 기술한다.
PDF

Low Power Parallel Acquisition Scheme for UWB Systems (저전력 병렬탐색기법을 이용한 UWB시스템의 동기 획득)

Kim, Sang-In;Cho, Kyoung-Rok
- The Journal of the Korea Contents Association
- /
- v.7 no.1
- /
- pp.147-154
- /
- 2007
In this paper, we propose a new parallel search algorithm to acquire synchronization for UWB(Ultra Wideband) systems that reduces computation of the correlation. The conventional synchronization acquisition algorithms check all the possible signal phases simultaneously using multiple correlators. However it reduces the acquisition time, it makes high power consumption owing to increasing of correlation. The proposed algorithm divides the preamble signal to input the correlator into an m-bit bunch. We check the result of the correlation at first stage of an m-bit bunch data and predict whether it has some synchronization acquisition information or not. Thus, it eliminates the unnecessary operation and save the number of correlation. We evaluate the proposed algorithm under the AWGN and the multi-Path channel model with MATLAB. The proposed parallel search scheme reduces number of the correlation 65% on the AWGN and 20% on the multi-path fading channel.
https://doi.org/10.5392/JKCA.2007.7.1.147 인용 PDF

Thread-Level Parallelism using Java Thread and Network Resources (자바 스레드와 네트워크 자원을 이용한 병렬처리)

Kim, Tae-Yong
- Journal of Advanced Navigation Technology
- /
- v.14 no.6
- /
- pp.984-989
- /
- 2010
In this paper, parallel programming technique by using Java Thread is introduced so as to develop parallel design tool to analyze the small micro flow sensor. To estimate computing time for Thread-level parallelism, the performances of two experimental models for potential problem subject to Thermal transfer equation are examined. As a result, if the number of network PC is increase, computing time for parallelism on network environment is enhanced to be almost n times. The micro sensor design tool based on distributed computing can be utilized to analyze a large scale problem.
PDF KSCI

A Design and Implementation of a Java Parallel Processing System based on the WWW and Its Performance Improvement Schemes (WWW기반 자바 병렬 처리 시스템의 설계 및 구현과 성능 향상 기법)

한연희;박찬열;정영식;황종선
- Proceedings of the Korean Information Science Society Conference
- /
- 1998.10a
- /
- pp.715-717
- /
- 1998
인터넷이 급속도로 발전하여 이러한 환경에서 네트워크 연결된 여러 호스트들의 자원을 이용하는 시도가 활발하게 이루이지고 있다. 본 논문은 이러한 환경에서 의뢰인-병렬처리서버-작업자 구성을 이용하여, 작업자 애플릿을 임의의 호스트에 분산시키고, 대량의 연산 수행을 지닌 작업을 배분하여 수행시틴 뒤, 그 결과를 의뢰인에게 보여주는 WWW기반 자바 병렬 시스템의 설계 및 구현에 관하여 기술한다. 성능 향상을 위해서 자바의 원격 메소드 호출(Remote Method Invocation)을 이용한 애플릿간 통신 메커니즘을 구현하고, 작업자의 결과를 의뢰인에게 서버를 거치지 않고 곧바로 보내도록 한다. 또한 각 작업자마다의 성능비를 분석하여 태스크들을 할당하는 방법을 통해 작업 시간을 단축시킨다. 이 시스템에 연산 수행량이 많은 프랙탈 이미지 처리 작업을 배분하여 수행시키고, 작업 태스크의 크기에 따른 수행성능과 작업 배분방법에 따른 수행성능을 측정하여 그 결과를 비교, 제시한다.
PDF

Parallel Processing of Structural Optimization Using PC Transputer System (PC 트랜스퓨터 시스템을 이용한 구조최적화의 병렬처리)

황진하;박종희
- Journal of the Computational Structural Engineering Institute of Korea
- /
- v.12 no.2
- /
- pp.233-241
- /
- 1999
본 연구는 개별 메모리를 갖는 소결합 구조의 MIMD형 병렬컴퓨터인 트랜스퓨터시스템 하에서 구조최적화를 위한 병렬처리 과정을 보이고 시험모델에 적용하여 타당성 및 효율성을 검증한다. 전체 최적화과정의 대부분을 차지하는 해석 및 민감도 알고리즘은 영역단위의 병렬성을 갖는 부구조화에 근거하고 하드웨어 구성에 맞춰 변환 재구성한다. 각 노드간 통신은 정적응축과 설계도함수에 한정, 그 횟수를 최소화하고 그들을 동기화하므로써 개별메모리형 연산모델의 약점인 통신비용의 문제를 해소한다. PC를 호스트로 한 수치실험은 고속화 효율성 면에서 고무적인 결과를 보여주고 있으며, 이런 점에서 시스템의 확장성을 함께 고려한다면 트랜스퓨터 시스템에 기초한 병렬처리는 공학 환경의 변화와 요구에 부응하는 좋은 대안이 될 수 있다.
PDF

Exploration of Optimization Environment for CUDA-based Cholesky Decomposition (CUDA 기반 숄레스키 분해 성능 최적화 환경 탐색)

Junbeom Kang;Myungho Lee;Neungsoo Park
- Proceedings of the Korea Information Processing Society Conference
- /
- 2024.05a
- /
- pp.15-17
- /
- 2024
최근 다양한 연구 분야에서는 CUDA 프레임워크를 이용하여 병렬 처리를 통해 연산 시간을 단축하는데 성공하고 있다. 이 중 숄레스키 분해는 양의 정부호 행렬을 하삼각행렬로 분해하는 과정에서 많은 행렬 곱셈이 요구되어 GPU 의 구조적 특징을 활용하면 상당한 가속화가 가능하다. 따라서 이 논문에서는 CUDA 코어에 연산을 할당할 때, 핵심 요소인 블록의 개수와 블록 당 쓰레드 개수를 조절할 수 있는 병렬 숄레스키 분해 연산 프로그램을 구현하였다. 서로 다른 세 종류의 행렬 크기에 대해 다양한 블록 수-쓰레드 수 환경을 설정하여 가속화 정도를 측정한 결과, 각 행렬 별 최적 환경에서 동일 그룹 내 최장 시간 대비, 1000x1000 행렬에서는 약 1.80 배, 2000x2000 행렬에서는 약 2.94 배의 추가적인 가속화를 달성하였다.
https://doi.org/10.3745/PKIPS.y2024m05a.15 인용 PDF

Methodology and its Hardware Architecture for High-speed Parallel Computation of Computer Generated Hologram (컴퓨터 생성 홀로그램의 고속 병렬 연산을 위한 연산방식 및 하드웨어 구조)

Yang, Wol-Sung;Choi, Hyun-Jun;Seo, Young-Ho;Yoo, Ji-Sang;Kim, Dong-Wook
- Proceedings of the Korean Society of Broadcast Engineers Conference
- /
- 2010.11a
- /
- pp.30-33
- /
- 2010
본 논문에서는 연산에 의해 디지털 홀로그램(computer-generated hologram, CGH)을 생성할 때 많은 계산량으로 속도가 지연되는 문제를 해결하기 위해 연산식을 수정하고 이를 하드웨어로 구현한다. 기존에 제시된 CGH 연산 알고리즘에 비해 제안한 알고리즘은 디지털 홀로그램의 완벽한 병렬처리가 가능하게 하여 속도지연의 문제를 해소한다. 구현 결과 하드웨어가 주어진다면 최대 3사이클에 한 광원으로부터의 홀로그램성분 전체를 연산할 수 있고, 파이프라인 기법을 사용하면 두 사이클의 지연시간 후 매 사이클마다 한 광원에 대한 홀로그램 연산결과를 얻을 수 있다.
PDF

Search Result 552, Processing Time 0.029 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)