Search | Korea Science

Design of a Parallel Rendering Processor Architecture with Effective Memory System (효과적인 메모리 구조를 갖는 병렬 렌더링 프로세서 설계)

Park Woo-Chan;Yoon Duk-Ki;Kim Kyoung-Su
- The KIPS Transactions:PartA
- /
- v.13A no.4 s.101
- /
- pp.305-316
- /
- 2006
Current rendering processors are organized mainly to process a triangle as fast as possible and recently parallel 3D rendering processors, which can process multiple triangles in parallel with multiple rasterizers, begin to appear. For high performance in processing triangles, it is desirable for each rasterizer have its own local pixel cache. However, the consistency problem may occur in accessing the data at the same address simultaneously by more than one rasterizer. In this paper, we propose a parallel rendering processor architecture resolving such consistency problem effectively. Moreover, the proposed architecture reduces the latency due to a pixel cache miss significantly. For the above two goals, effective memory organizations including a new pixel cache architecture are presented. The experimental results show that the proposed architecture achieves almost linear speedup at best case even in sixteen rasterizers.
https://doi.org/10.3745/KIPSTA.2006.13A.4.305 인용 PDF KSCI

Efficient pipelined FFT processor for the MIMO-OFDM systems (MIMO-OFDM 시스템을 위한 효율적인 파이프라인 FFT 프로세서의 설계)

Lee, Sang-Min;Jung, Yun-Ho;Kim, Jae-Seok
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.32 no.10C
- /
- pp.1025-1031
- /
- 2007
This paper proposes an area-efficient pipeline FFT processor for MIMO-OFDM systems with four transmitting and four receiving antennas. Since the MIMO-OFDM system transmits multiple data streams, the complexity for the MIMO-OFDM system with a single-channel FFT processor increases linearly with the increase of the number of transmit channels. The proposed FFT processor is based on multi-channel structure, and therefore it can efficiently support multiple data streams. With the mixed radix algorithm, the number of non-trivial multiplications of the proposed FFT processor is decreased. The proposed FFT processor is synthesized with CMOS $0.18{\mu}m$ process and reduces the logic gates by 25% over a 4-channel Radix-4 multi-path delay commutator (R4MDC) FFT processor. Since the MIMO-OFDM FFT processor is one of the largest modules in the systems, the proposed FFT processor will be a vast contribution improvement to the low complexity design of MIMO-OFDM systems.
PDF KSCI

Join Operation of Parallel Database System with Large Main Memory (대용량 메모리를 가진 병렬 데이터베이스 시스템의 조인 연산)

Park, Young-Kyu
- Journal of the Korea Society of Computer and Information
- /
- v.12 no.3
- /
- pp.51-58
- /
- 2007
The shared-nothing multiprocessor architecture has advantages in scalability, this architecture has been adopted in many multiprocessor database system. But, if the data are not uniformly distributed across the processors, load will be unbalanced. Therefore, the whole system performance will deteriorate. This is the data skew problem, which usually occurs in processing parallel hash join. Balancing the load before performing join will resolve this problem efficiently and the whole system performance can be improved. In this paper, we will present an algorithm using merit of very large memory to reduce disk access overhead in performing load balancing and to efficiently solve the data skew problem. Also, we will present analytical model of our new algorithm and present the result of some performance study we made comparing our algorithm with the other algorithms in handling data skew.
PDF

An Optimal Instruction Fetch Strategy for SMT Processors (SMT 프로세서에 최적화된 명령어 페치 전략에 관한 연구)

홍인표;문병인;김문경;이용석
- The Journal of Korean Institute of Communications and Information Sciences
- /
- v.27 no.5C
- /
- pp.512-521
- /
- 2002
Recently, conventional superscalar RISC processors arrive their performance limit, and many researches on the next-generation architecture are concentrated on SMT(Simultaneous Multi-Threading). In SMT processors, multiple threads are executed simultaneously and share hardware resources dynamically. In this case, it is more important to supply instructions from multiple threads to processor core efficiently than ever. Because SMT architecture shows higher IPC(Instructions per cycle) than superscalar architecture, performance is influenced by fetch bandwidth and the size of fetch queue. Moreover, to use TLP(Thread Level Parallelism) efficiently, fetch thread selection algorithm and fetch bandwidth for each selected threads must be carefully designed. Thus, in this paper, the performance values influenced by these factors are analyzed. Based on the results, an optimal instruction fetch strategy for SMT processors is proposed.
PDF KSCI

A Topology Independent Heuristic Load Balancing Algorithm for Multiprocessor Environment (다중 프로세서 환경에서 연결구조에 무관한 휴리스틱 부하평형 알고리즘)

Song Eui-Seok;Sung Yeong-Rak;Oh Ha-Ryoung
- Journal of the Institute of Electronics Engineers of Korea CI
- /
- v.42 no.1
- /
- pp.35-44
- /
- 2005
This paper proposes an efficient heuristic load balancing algorithm for multiprocessor systems. The algorithm minimizes the number of idle links to distribute load traffic and reduces its communication cost. Each processor iteratively tries to transfer unit load to/from all neighbor processors. However, real load transfer is collectively done after all load traffic is calculated. This prevents useless traffic and thus reduces the overall load traffic. The proposed algorithm can be employed in various interconnection topologies with slight modifications. In this paper, it is applied to hypercube, mesh, k-ary n-cube and general graph environments. For performance evaluation, simulation studies are performed. The proposed algorithm and the well-known existing algorithms are implemented and compared. The results show that the proposed algorithm always balances the loads perfectly. furthermore, in comparison with the existing algorithms, it reduces the communication costs by 77%, 74% and 73% in the hypercube, the mesh, and k-ary n-cube, respectively.
PDF KSCI

Adaptive Pipeline Architecture for an Asynchronous Embedded Processor (비동기식 임베디드 프로세서를 위한 적응형 파이프라인 구조)

Lee, Seung-Sook;Lee, Je-Hoon;Lim, Young-Il;Cho, Kyoung-Rok
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.44 no.1
- /
- pp.51-58
- /
- 2007
This paper presented an adaptive pipeline architecture for a high-performance and low-power asynchronous processor. The proposed pipeline architecture employed a stage-skipping and a stage-combining scheme. The stage-skipping scheme can skip the operation of a bubble stage that is not used pipeline stage in an instruction execution. In the stage-combining scheme, two consecutive stages can be joined to form one stage if the latter stage is empty. The proposed pipeline architecture could reduce the processing time and power consumption. The proposed architecture supports multi-processing in the EX stage that executes parallel 4 instructions. We designed an asynchronous microprocessor to estimate the efficiency of the proposed pipeline architecture that was synthesized to a gate level design using a $0.35-{\mu}m$ CMOS standard cell library. We evaluated the performance of the target processor using SPEC2000 benchmark programs. The proposed architecture showed about 2.3 times higher speed than the asynchronous counterpart, AMULET3i. As a result, the proposed pipeline schemes and architecture can be used for asynchronous high-speed processor design
PDF KSCI

A Dedicated Bus System for Cache Coherence (캐시 일관성 유지를 위한 전용 버스 시스템)

천희식;김우완
- Proceedings of the Korean Information Science Society Conference
- /
- 1998.10a
- /
- pp.30-32
- /
- 1998
멀티프로세서 시스템을 설계할 경우에는 공유메모리 구조와 메시지 전달방법의 두 가지의 패러다임을 바탕으로 하게 된다. 데이터 분할과 동적 부하 분산 문제를 단순화시틸 수 있으며 확장성을 용이하게 지원하는 장점을 가지고 있는 공유메모리 구조의 멀티프로세서 시스템에서 각 프로세서가 자신의 전용 캐시를 가지는 경우에는 메인 메모리와 이러한 전용 캐시내에 존재하는 데이터사본간에 일관성 문제가 발생한다. 본 논문에서는 일관성 유지를 위해 제안되어 있는 여러 알고리즘 중 처리 노드와 고대역 저지연 인터커넥션 네트워크로 구성되는 공유메모리 구조의 멀티프로세서 프로토타입인 DASH 프로토콜을 지원하기 위한 전용 버스 시스템을 완전 개방형인 IEEE Futurebus+ 스탠다드에 준비하여 설계한 다음, 이 시스템이 DASH 프로토콜을 지원하려 캐시의 일관성을 유지하기 위해 필요한 각종 행동과 기존의 범용 버스 시스템이 수행하는 행동의 병렬 처리를 지원할 수 있음을 시뮬레이션으로 증명한다.
PDF

An Analysis and Simulation of sRIO for Implementation of Robot's Hetero-Multi Processor (로봇의 이기종 다중 프로세서 구현을 위한 Serial RapidIO(sRIO) 분석 및 시뮬레이션)

Moon, Yong-Seomn;Roh, Sang-Hyun;Jo, Kwang-Hun;Park, Jong-Kyu;Bae, Young-Chul
- Journal of Advanced Navigation Technology
- /
- v.14 no.1
- /
- pp.57-65
- /
- 2010
In this paper, we propose the structure of heterogeneous multiprocessor's concept, which is the structure of the new type of the robot controller, and we introduce an integrating structure method, which is distributed multiprocessor within controller using sRIO. We also perform the computer simulation with using the sRIO IP core which was designed within FPGA as the method for implementation of integrated heterogeneous multiprocessor by sRIO communication. Thus, we verify the result.
PDF KSCI

Processor-Architecture for the Faster Processing of Genetic Algorithm (유전 알고리듬 처리속도 향상을 위한 프로세서 구조)

윤한얼;정재원;심귀보
- Proceedings of the Korean Institute of Intelligent Systems Conference
- /
- 2004.10a
- /
- pp.169-172
- /
- 2004
유전 알고리듬은 NP-Hard 문제의 해결이나, 함수 최적화, 복잡한 제어기의 파라미터 값 추적 등, 광범위한 분야에 걸쳐 이용되고 있다 일반적인 유전 알고리듬은 적합도 함수를 통해 해들의 품질을 결정하고, 해들의 품질에 따라 선택 연산을 거쳐, 교차나 돌연변이를 통해 우수한 품질의 해를 찾는 과정을 가진다 현재 이 과정은 대부분 소프트웨어적으로 구현되어 범용 프로세서를 통해 수행된다. 그러나 높은 소프트웨어 의존성은 해집단의 크기가 커질수록 교차/변이 연산과 해들의 품질비교에 수행되는 시간을 크게 증가시키는 약점이 있다. 따라서 본 논문에서는 순위 기반 선택과 일점 교차(one-point crossover)를 사용한다는 제약하에, 해들의 순위를 정렬 네트워크를 통해 결정하고 해들을 Residue Number System(RNS)로 표현하여 하드웨어적으로 교차연산을 처리하는 프로세서 구조를 제안한다 이러한 접근을 통해 해들의 품질비교에 걸리는 시간을 크게 줄이고 교차/변이 연산의 효율을 높일 수 있다.
PDF

Design of a RISC Processor with an Efficient Processing Unit for Multimedia Data (효율적인 멀티미디어데이터 처리를 위한 RISC Processor의 설계)

조태헌;남기훈;김명환;이광엽
- Proceedings of the IEEK Conference
- /
- 2003.07b
- /
- pp.867-870
- /
- 2003
본 논문은 멀티미디어 데이터 처리를 위한 효율적인 RISC 프로세서 유닛의 설계를 목표로 Vector 프로세서의 SIMD(Single Instruction Multiple Data) 개념을 바탕으로 고정된 연산기 데이터 비트 수에 비해 상대적으로 작은 비트수의 데이터 연산의 부분 병렬화를 통하여 멀티미디어 데이터 연산의 기본이 되는 곱셈누적(MAC : Multiply and Accumulate) 연산의 성능을 향상 시킨다. 또한 기존의 MMX나 VIS 등과 같은 범용 프로세서들의 부분 병렬화를 위해 전 처리 과정의 필요충분조건인 데이터의 연속성을 위해 서로 다른 길이의 데이터 흑은 비트 수가 작은 멀티미디어의 데이터를 하나의 데이터로 재처리 하는 재정렬 혹은 Packing/Unpacking 과정이 성능 전체적인 성능 저하에 작용하게 되므로 본 논문에서는 기존의 프로세서의 연산기 구조를 재이용하여 병렬 곱셈을 위한 연산기 구조를 구현하고 이를 위한 데이터 정렬 연산 구조를 제안한다.
PDF

Search Result 1,040, Processing Time 0.024 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)