• Title/Summary/Keyword: Data parallel architecture

Search Result 292, Processing Time 0.02 seconds

David II: A new architecture for parallel rendering processors with effective memory system (David II: 효과적인 메모리 시스템을 가지는 병렬 렌더링 프로세서)

  • Lee, Kil-Whan;Park, Woo-Chan;Kim, Il-San;Han, Tack-Don
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2004.05a
    • /
    • pp.1655-1658
    • /
    • 2004
  • Current rendering processors are organized mainly to process a triangle as fast as possible and recently parallel 3D rendering processors, which can process multiple triangles in parallel with multiple rasterizers, begin to appear. For high performance in processing triangles, it is desirable for each rasterizer have its own local pixel cache. However, the consistency problem may occur in accessing the data at the same address simultaneously by more than one rasterizer. In this paper, we propose a parallel rendering processor architecture, called DAVID II, resolving such consistency problem effectively. Moreover, the proposed architecture reduces the latency due to a pixel cache miss significantly. The experimental results show that DAVID II achieves almost linear speedup at best case even in sixteen rasterizers.

  • PDF

A Study on Efficient Executions of MPI Parallel Programs in Memory-Centric Computer Architecture

  • Lee, Je-Man;Lee, Seung-Chul;Shin, Dongha
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.1
    • /
    • pp.1-11
    • /
    • 2020
  • In this paper, we present a technique that executes MPI parallel programs, that are developed on processor-centric computer architecture, more efficiently on memory-centric computer architecture without program modification. The technique we present here improves performance by replacing low-speed data communication over the network of MPI library functions with high-speed data communication using the property called fast large shared memory of memory-centric computer architecture. The technique we present in the paper is implemented in two programs. The first program is a modified MPI library called MC-MPI-LIB that runs MPI parallel programs more efficiently on memory-centric computer architecture preserving the semantics of MPI library functions. The second program is a simulation program called MC-MPI-SIM that simulates the performance of memory-centric computer architecture on processor-centric computer architecture. We developed and tested the programs on distributed systems environment deployed on Docker based virtualization. We analyzed the performance of several MPI parallel programs and showed that we achieved better performance on memory-centric computer architecture. Especially we could see very high performance on the MPI parallel programs with high communication overhead.

An FPGA-based Parallel Hardware Architecture for Real-time Eye Detection

  • Kim, Dong-Kyun;Jung, Jun-Hee;Nguyen, Thuy Tuong;Kim, Dai-Jin;Kim, Mun-Sang;Kwon, Key-Ho;Jeon, Jae-Wook
    • JSTS:Journal of Semiconductor Technology and Science
    • /
    • v.12 no.2
    • /
    • pp.150-161
    • /
    • 2012
  • Eye detection is widely used in applications, such as face recognition, driver behavior analysis, and human-computer interaction. However, it is difficult to achieve real-time performance with software-based eye detection in an embedded environment. In this paper, we propose a parallel hardware architecture for real-time eye detection. We use the AdaBoost algorithm with modified census transform(MCT) to detect eyes on a face image. We parallelize part of the algorithm to speed up processing. Several downscaled pyramid images of the eye candidate region are generated in parallel using the input face image. We can detect the left and the right eye simultaneously using these downscaled images. The sequential data processing bottleneck caused by repetitive operation is removed by employing a pipelined parallel architecture. The proposed architecture is designed using Verilog HDL and implemented on a Virtex-5 FPGA for prototyping and evaluation. The proposed system can detect eyes within 0.15 ms in a VGA image.

Architecture design for speeding up Multi-Access Memory System(MAMS) (Multi-Access Memory System(MAMS)의 속도 향상을 위한 아키텍처 설계)

  • Ko, Kyung-sik;Kim, Jae Hee;Lee, S-Ra-El;Park, Jong Won
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.54 no.6
    • /
    • pp.55-64
    • /
    • 2017
  • High-capacity, high-definition image applications need to process considerable amounts of data at high speed. Accordingly, users of these applications demand a high-speed parallel execution system. To increase the speed of a parallel execution system, Park (2004) proposed a technique, called MAMS (Multi-Access Memory System), to access data in several execution units without the conflict of parallel processing memories. Since then, many studies on MAMS have been conducted, furthering the technique to MAMS-PP16 and MAMS-PP64, among others. As a memory architecture for parallel processing, MAMS must be constructed in one chip; therefore, a method to achieve the identical functionality as the existing MAMS while minimizing the architecture needs to be studied. This study proposes a method of miniaturizing the MAMS architecture in which the architectures of the ACR (Address Calculation and Routing) circuit and MMS (Memory Module Selection) circuit, which deliver data in memories to parallel execution units (PEs), do not use the MMS circuit, but are constructed as one shift and conditional statements whose number is the same as that of memory modules inside the ACR circuit. To verify the performance of the realized architecture, the study conducted the processing time of the proposed MAMS-PP64 through an image correlation test, the results of which demonstrated that the ratio of the image correlation from the proposed architecture was improved by 1.05 on average.

Three-Parallel Reed-Solomon based Forward Error Correction Architecture for 100Gb/s Optical Communications (100Gb/s급 광통신시스템을 위한 3-병렬 Reed-Solomon 기반 FEC 구조 설계)

  • Choi, Chang-Seok;Lee, Han-Ho
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.11
    • /
    • pp.48-55
    • /
    • 2009
  • This paper presents a high-speed Forward Error Correction (FEC) architecture based on three-parallel Reed-Solomon (RS) decoder for next-generation 100-Gb/s optical communication systems. A high-speed three-parallel RS(255,239) decoder has been designed and the derived structure can also be applied to implement the 100-Gb/s RS-FEC architecture. The proposed 100-Gb/s RS-FEC has been implemented with 0.13-${\mu}m$ CMOS standard cell technology in a supply voltage of 1.2V. The implementation results show that 16-Ch. RS-FEC architecture can operate at a clock frequency of 300MHz and has a throughput of 115-Gb/s for 0.13-${\mu}m$ CMOS technology. As a result, the proposed three-parallel RS-FEC architecture has a much higher data processing rate and low hardware complexity compared with the conventional two-parallel, three-parallel and serial RS-FEC architectures.

Adaptive and optimized agent placement scheme for parallel agent-based simulation

  • Jin, Ki-Sung;Lee, Sang-Min;Kim, Young-Chul
    • ETRI Journal
    • /
    • v.44 no.2
    • /
    • pp.313-326
    • /
    • 2022
  • This study presents a noble scheme for distributed and parallel simulations with optimized agent placement for simulation instances. The traditional parallel simulation has some limitations in that it does not provide sufficient performance even though using multiple resources. The main reason for this discrepancy is that supporting parallelism inevitably requires additional costs in addition to the base simulation cost. We present a comprehensive study of parallel simulation architectures, execution flows, and characteristics. Then, we identify critical challenges for optimizing large simulations for parallel instances. Based on our cost-benefit analysis, we propose a novel approach to overcome the performance constraints of agent-based parallel simulations. We also propose a solution for eliminating the synchronizing cost among local instances. Our method ensures balanced performance through optimal deployment of agents to local instances and an adaptive agent placement scheme according to the simulation load. Additionally, our empirical evaluation reveals that the proposed model achieves better performance than conventional methods under several conditions.

A study on the Cost-effective Architecture Design of High-speed Soft-decision Viterbi Decoder for Multi-band OFDM Systems (Multi-band OFDM 시스템용 고속 연판정 비터비 디코더의 효율적인 하드웨어 구조 설계에 관한 연구)

  • Lee, Seong-Joo
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.43 no.11 s.353
    • /
    • pp.90-97
    • /
    • 2006
  • In this paper, we present a cost-effective architecture of high-speed soft-decision Viterbi decoder for Multi-band OFDM(MB-OFDM) systems. In the design of modem for MB-OFDM systems, a parallel processing architecture is general]y used for the reliable hardware implementation, because the systems should support a very high-speed data rate of at most 480Mbps. A Viterbi decoder also should be designed by using a parallel processing structure and support a very high-speed data rate. Therefore, we present a optimized hardware architecture for 4-way parallel processing Viterbi decoder in this paper. In order to optimize the hardware of Viterbi decoder, we compare and analyze various ACS architectures and find the optimal one among them with respect to hardware complexity and operating frequency The Viterbi decoder with a optimal hardware architecture is designed and verified by using Verilog HDL, and synthesized into gate-level circuits with TSMC 0.13um library. In the synthesis results, we find that the Viterbi decoder contains about 280K gates and works properly at the speed required in MB-OFDM systems.

Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture (GPU 하드웨어 아키텍처 기반 sub-warp 단위 병렬 프리픽스(prefix) 연산의 정확한 구현)

  • Park, Taejung
    • Journal of Digital Contents Society
    • /
    • v.18 no.3
    • /
    • pp.613-619
    • /
    • 2017
  • This paper presents a CUDA (Compute Unified Device Architecture) code to achieve correct GPU parallel segmented prefix operation results with less than 32 segment length for large data arrays. Mark Harris and Michael Garland had published CUDA code to address the tasks. This paper shows that their code does not generate correct results when the local segment length is less than 32, discusses the cause of the problem, and presents a CUDA code that generates correct results. The segmented parallel prefix operation presented in this paper can be applied as a building block to various large parallel processing algorithms including the k-nearest neighbor search problems.

A Functional Design of Programmable Logic Controller Based on Parallel Architecture (병렬 구조에 의한 가변 논리제어장치의 기능적 설계)

  • 이정훈;신현식
    • The Transactions of the Korean Institute of Electrical Engineers
    • /
    • v.40 no.8
    • /
    • pp.836-844
    • /
    • 1991
  • PLC(programmable logic controller) system is widely used for the control of factory. PLC system receives ladder diagram which is drawn by the user to implement hardware logic, converts the ladder diagram into sequence program which is executable in the PLC system, and executes the sequence program indefinitely unless user breaks. The sequence program processes the data of on/off signal, and endures 1 scan delay and missing of pulse-type signal shorter than a scan time. So, data dependency doesn't exist. By applying theis characteristics to multiprocessor architecture, we design parellel PLC functionally and evaluate performance upgrade. Parallel PLC consists of central processing module, N general processing unit, and a shared memory by master-slave type. Each module executes allocated sequence program by the control of central processing module. We can expect performance upgrade by parallel processing, and reliability by relocation of sequence program when error occurs in processing module.

  • PDF

Discrete Cosine Transform Algorithms for the VLSI Parallel Implementation (VLSI 병렬 연산을 위한 여현 변환 알고리듬)

  • 조남익;이상욱
    • Journal of the Korean Institute of Telematics and Electronics
    • /
    • v.25 no.7
    • /
    • pp.851-858
    • /
    • 1988
  • In this paper, we propose two different VLSI architectures for the parallel computation of DCT (discrete cosine transform) algorithm. First, it is shown that the DCT algorithm can be implemented on the existing systolic architecture for the DFT(discrete fourier transform) by introducing some modification. Secondly, a new prime factor DCT algorithm based on the prime factor DFT algorithm is proposed. And it is shown that the proposed algorihtm can be implemented in parallel on the systolic architecture for the prime factor DFT. However, proposed algorithm is only applicable to the data length which can be decomposed into relatively prime and odd numbers. It is also found that the proposed systolic architecture requires less multipliers than the structures implementing FDCT(fast DCT) algorithms directly.

  • PDF