• Title/Summary/Keyword: 병렬처리 알고리즘

Search Result 697, Processing Time 0.023 seconds

A CPU and GPU Heterogeneous Computing Techniques for Fast Representation of Thin Features in Liquid Simulations (액체 시뮬레이션의 얇은 특징을 빠르게 표현하기 위한 CPU와 GPU 이기종 컴퓨팅 기술)

  • Kim, Jong-Hyun
    • Journal of the Korea Computer Graphics Society
    • /
    • v.24 no.2
    • /
    • pp.11-20
    • /
    • 2018
  • We propose a new method particle-based method that explicitly preserves thin liquid sheets for animating liquids on CPU-GPU heterogeneous computing framework. Our primary contribution is a particle-based framework that splits at thin points and collapses at dense points to prevent the breakup of liquid on GPU. In contrast to existing surface tracking methods, the our method does not suffer from numerical diffusion or tangles, and robustly handles topology changes on CPU-GPU framework. The thin features are detected by examining stretches of distributions of neighboring particles by performing PCA(Principle component analysis), which is used to reconstruct thin surfaces with anisotropic kernels. The efficiency of the candidate position extraction process to calculate the position of the fluid particle was rapidly improved based on the CPU-GPU heterogeneous computing techniques. Proposed algorithm is intuitively implemented, easy to parallelize and capable of producing quickly detailed thin liquid animations.

Design and Implementation of a Low-Complexity and High-Throughput MIMO Symbol Detector Supporting up to 256 QAM (256 QAM까지 지원 가능한 저 복잡도 고 성능의 MIMO 심볼 검파기의 설계 및 구현)

  • Lee, Gwang-Ho;Kim, Tae-Hwan
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.6
    • /
    • pp.34-42
    • /
    • 2014
  • This paper presents a low-complexity and high-throughput symbol detector for two-spatial-stream multiple-input multiple-output systems based on the modified maximum-likelihood symbol detection algorithm. In the proposed symbol detector, the cost function is calculated incrementally employing a multi-cycle architecture so as to eliminate the complex multiplications for each symbol, and the slicing operations are performed hierarchically according to the range of constellation points by a pipelined architecture. The proposed architecture exhibits low hardware complexity while supporting complicated modulations such as 256 QAM. In addition, various modulations and antenna configurations are supported flexibly by reconfiguring the pipeline for the slicing operation. The proposed symbol detector is implemented with 38.7K logic gates in a $0.11-{\mu}m$ CMOS process and its throughput is 166 Mbps for $2{\times}$3 16-QAM and 80Mbps for $2{\times}3$ 64-QAM where the operating frequency is 478 MHz.

Design and Implementation of an Approximate Surface Lens Array System based on OpenCL (OpenCL 기반 근사곡면 렌즈어레이 시스템의 설계 및 구현)

  • Kim, Do-Hyeong;Song, Min-Ho;Jung, Ji-Sung;Kwon, Ki-Chul;Kim, Nam;Kim, Kyung-Ah;Yoo, Kwan-Hee
    • The Journal of the Korea Contents Association
    • /
    • v.14 no.10
    • /
    • pp.1-9
    • /
    • 2014
  • Generally, integral image used for autostereoscopic 3d display is generated for flat lens array, but flat lens array cannot provide a wide range of view for generated integral image because of narrow range of view. To make up for this flat lens array's weak point, curved lens array has been proposed, and due to technical and cost problem, approximate surface lens array composed of several flat lens array is used instead of ideal curved lens array. In this paper, we constructed an approximate surface lens array arranged for $20{\times}8$ square flat lens in 100mm radius sphere, and we could get about twice angle of view compared to flat lens array. Specially, unlike existing researches which manually generate integral image, we propose an OpenCL GPU parallel process algorithm for generating real-time integral image. As a result, we could get 12-20 frame/sec speed about various 3D volume data from $15{\times}15$ approximate surface lens array.

Efficient Scheduling Schemes for Low-Area Mixed-radix MDC FFT Processor (저면적 Mixed-radix MDC FFT 프로세서를 위한 효율적인 스케줄링 기법)

  • Jang, Jeong Keun;Sunwoo, Myung Hoon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.54 no.7
    • /
    • pp.29-35
    • /
    • 2017
  • This paper presents a high-throughput area-efficient mixed-radix fast Fourier transform (FFT) processor using the efficient scheduling schemes. The proposed FFT processor can support 64, 128, 256, and 512-point FFTs for orthogonal frequency division multiplexing (OFDM) systems, and can achieve a high throughput using mixed-radix algorithm and eight-parallel multipath delay commutator (MDC) architecture. This paper proposes new scheduling schemes to reduce the size of read-only memories (ROMs) and complex constant multipliers without increasing delay elements and computation cycles; thus, reducing the hardware complexity further. The proposed mixed-radix MDC FFT processor is designed and implemented using the Samsung 65nm complementary metal-oxide semiconductor (CMOS) technology. The experimental result shows that the area of the proposed FFT processor is 0.36 mm2. Furthermore, the proposed processor can achieve high throughput rates of up to 2.64 GSample/s at 330 MHz.

A Study on Distributed Parallel SWRL Inference in an In-Memory-Based Cluster Environment (인메모리 기반의 클러스터 환경에서 분산 병렬 SWRL 추론에 대한 연구)

  • Lee, Wan-Gon;Bae, Seok-Hyun;Park, Young-Tack
    • Journal of KIISE
    • /
    • v.45 no.3
    • /
    • pp.224-233
    • /
    • 2018
  • Recently, there are many of studies on SWRL reasoning engine based on user-defined rules in a distributed environment using a large-scale ontology. Unlike the schema based axiom rules, efficient inference orders cannot be defined in SWRL rules. There is also a large volumet of network shuffled data produced by unnecessary iterative processes. To solve these problems, in this study, we propose a method that uses Map-Reduce algorithm and distributed in-memory framework to deduce multiple rules simultaneously and minimizes the volume data shuffling occurring between distributed machines in the cluster. For the experiment, we use WiseKB ontology composed of 200 million triples and 36 user-defined rules. We found that the proposed reasoner makes inferences in 16 minutes and is 2.7 times faster than previous reasoning systems that used LUBM benchmark dataset.

Constructing a Support Vector Machine for Localization on a Low-End Cluster Sensor Network (로우엔드 클러스터 센서 네트워크에서 위치 측정을 위한 지지 벡터 머신)

  • Moon, Sangook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.18 no.12
    • /
    • pp.2885-2890
    • /
    • 2014
  • Localization of a sensor network node using machine learning has been recently studied. It is easy for Support vector machines algorithm to implement in high level language enabling parallelism. Raspberrypi is a linux system which can be used as a sensor node. Pi can be used to construct IP based Hadoop clusters. In this paper, we realized Support vector machine using python language and built a sensor network cluster with 5 Pi's. We also established a Hadoop software framework to employ MapReduce mechanism. In our experiment, we implemented the test sensor network with a variety of parameters and examined based on proficiency, resource evaluation, and processing time. The experimentation showed that with more execution power and memory volume, Pi could be appropriate for a member node of the cluster, accomplishing precise classification for sensor localization using machine learning.

High-Performance Line-Based Filtering Architecture Using Multi-Filter Lifting Method (다중필터 리프팅 방식을 이용한 고성능 라인기반 필터링 구조)

  • 서영호;김동욱
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.41 no.8
    • /
    • pp.75-84
    • /
    • 2004
  • In this paper, we proposed an efficient hardware architecture of line-based lifting algorithm for Motion JPEG2000. We proposed a new architecture of a lifting-based filtering cell which has an optimized and simplified structure. It was implemented in a hardware accommodating both (9,7) and (5,4) filter. Since the output rate is linearly proportional to the input rate, one can obtain the high throughput through parallel operation simply by adding the hardware units. It was implemented into both of ASIC and FPGA The 0.35${\mu}{\textrm}{m}$ CMOS library from Samsung was used for ASIC and Altera was the target for FRGA. In ASIC, the proposed architecture used 41,592 gates for the lifting arithmetic and 128 Kbit memory. For FPGA it used 6,520 LEs(Logic Elements) and 128 ESBs(Embedded System Blocks). The implementations were stably operated in the clock frequency of 128MHz and 52MHz, respectively.

High-Speed Implementations of Block Ciphers on Graphics Processing Units Using CUDA Library (GPU용 연산 라이브러리 CUDA를 이용한 블록암호 고속 구현)

  • Yeom, Yong-Jin;Cho, Yong-Kuk
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.18 no.3
    • /
    • pp.23-32
    • /
    • 2008
  • The computing power of graphics processing units(GPU) has already surpassed that of CPU and the gap between their powers is getting wider. Thus, research on GPGPU which applies GPU to general purpose becomes popular and shows great success especially in the field of parallel data processing. Since the implementation of cryptographic algorithm using GPU was started by Cook et at. in 2005, improved results using graphic libraries such as OpenGL and DirectX have been published. In this paper, we present skills and results of implementing block ciphers using CUDA library announced by NVIDIA in 2007. Also, we discuss a general method converting source codes of block ciphers on CPU to those on GPU. On NVIDIA 8800GTX GPU, the resulting speeds of block cipher AES, ARIA, and DES are 4.5Gbps, 7.0Gbps, and 2.8Gbps, respectively which are faster than the those on CPU.

Thermodynamics-Based Weight Encoding Methods for Improving Reliability of Biomolecular Perceptrons (생체분자 퍼셉트론의 신뢰성 향상을 위한 열역학 기반 가중치 코딩 방법)

  • Lim, Hee-Woong;Yoo, Suk-I.;Zhang, Byoung-Tak
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.12
    • /
    • pp.1056-1064
    • /
    • 2007
  • Biomolecular computing is a new computing paradigm that uses biomolecules such as DNA for information representation and processing. The huge number of molecules in a small volume and the innate massive parallelism inspired a novel computation method, and various computation models and molecular algorithms were developed for problem solving. In the meantime, the use of biomolecules for information processing supports the possibility of DNA computing as an application for biological problems. It has the potential as an analysis tool for biochemical information such as gene expression patterns. In this context, a DNA computing-based model of a biomolecular perceptron has been proposed and the result of its experimental implementation was presented previously. The weight encoding and weighted sum operation, which are the main components of a biomolecular perceptron, are based on the competitive hybridization reactions between the input molecules and weight-encoding probe molecules. However, thermodynamic symmetry in the competitive hybridizations is assumed, so there can be some error in the weight representation depending on the probe species in use. Here we suggest a generalized model of hybridization reactions considering the asymmetric thermodynamics in competitive hybridizations and present a weight encoding method for the reliable implementation of a biomolecular perceptron based on this model. We compare the accuracy of our weight encoding method with that of the previous one via computer simulations and present the condition of probe composition to satisfy the error limit.

A New Hardware Design for Generating Digital Holographic Video based on Natural Scene (실사기반 디지털 홀로그래픽 비디오의 실시간 생성을 위한 하드웨어의 설계)

  • Lee, Yoon-Hyuk;Seo, Young-Ho;Kim, Dong-Wook
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.49 no.11
    • /
    • pp.86-94
    • /
    • 2012
  • In this paper we propose a hardware architecture of high-speed CGH (computer generated hologram) generation processor, which particularly reduces the number of memory access times to avoid the bottle-neck in the memory access operation. For this, we use three main schemes. The first is pixel-by-pixel calculation rather than light source-by-source calculation. The second is parallel calculation scheme extracted by modifying the previous recursive calculation scheme. The last one is a fully pipelined calculation scheme and exactly structured timing scheduling by adjusting the hardware. The proposed hardware is structured to calculate a row of a CGH in parallel and each hologram pixel in a row is calculated independently. It consists of input interface, initial parameter calculator, hologram pixel calculators, line buffer, and memory controller. The implemented hardware to calculate a row of a $1,920{\times}1,080$ CGH in parallel uses 168,960 LUTs, 153,944 registers, and 19,212 DSP blocks in an Altera FPGA environment. It can stably operate at 198MHz. Because of the three schemes, the time to access the external memory is reduced to about 1/20,000 of the previous ones at the same calculation speed.