통합 검색 | Korea Science

Software Pipeline-Based Partitioning Method with Trade-Off between Workload Balance and Communication Optimization

Huang, Kai;Xiu, Siwen;Yu, Min;Zhang, Xiaomeng;Yan, Rongjie;Yan, Xiaolang;Liu, Zhili
- ETRI Journal
- /
- 제37권3호
- /
- pp.562-572
- /
- 2015
For a multiprocessor System-on-Chip (MPSoC) to achieve high performance via parallelism, we must consider how to partition a given application into different components and map the components onto multiple processors. In this paper, we propose a software pipeline-based partitioning method with cyclic dependent task management and communication optimization. During task partitioning, simultaneously considering computation load balance and communication optimization can cause interference, which leads to performance loss. To address this issue, we formulate their constraints and apply an integer linear programming approach to find an optimal partitioning result - one that requires a trade-off between these two factors. Experimental results on a reconfigurable MPSoC platform demonstrate the effectiveness of the proposed method, with 20% to 40% performance improvements compared to a traditional software pipeline-based partitioning method.
https://doi.org/10.4218/etrij.15.0114.0502 인용 PDF KSCI

동적 재구성이 가능한 SoC 3중 버스 구조 (Dynamically Reconfigurable SoC 3-Layer Bus Structure)

김규철;서병현
- 전기전자학회논문지
- /
- 제13권2호
- /
- pp.101-107
- /
- 2009
집적회로의 공정기술 및 설계기술이 발전함에 따라 많은 IP가 하나의 반도체 칩에 집적되어 하나의 시스템을 구성하는 SoC 설계가 많이 이루어지고 있다. 본 논문에서는 다양한 IP 간에 효율적인 데이터 통신이 이루어지도록 버스 상의 전송 특성에 따라 버스모드를 동적으로 재구성하는 SoC 3중 버스 구조를 제안한다. 제안된 버스는 다중-단일버스 모드, 단일-다중버스 모드로 재구성이 가능하며 따라서 단일버스 모드와 다중버스 모드의 장점을 모두 갖는다. 실험결과 제안된 버스구조는 기존의 고정된 버스구조보다 독립적이며 데이터 전송시간을 단축시킬 수 있음을 확인하였다. 그리고 제안된 버스구조를 JPEG 시스템에 적용한 결과 다중버스구조보다 평균 22%의 전송시간 단축을 얻을 수 있었다.
PDF

통신 시스템을 위한 고성능 재구성 가능 코프로세서의 설계 (Novel Reconfigurable Coprocessor for Communication Systems)

정철윤;선우명훈
- 대한전자공학회논문지SD
- /
- 제42권6호
- /
- pp.39-48
- /
- 2005
본 논문은 통신 시스템에서 요구하는 다양한 연산과 고속의 동작을 수행할 수 있는 재구성 가능 코프로세서를 제안하였다. 제안된 재구성 가능 코프로세서는 스크램블링, 인터리빙, 길쌈부호화, 비터비 디코딩, FFT 등과 같은 통신 시스템에 필수적인 연산 동작을 쉽게 구현할 수 있는 특징을 가진다. 제안된 재구성 가능 코프로세서는 VHDL로 설계하여 SEC 0.18$\mu$m 표준셀 라이브러리를 이용해 합성하였으며, 총 35,000 게이트에 3.84ns의 최대 동작 속도를 보였다. 제안된 코프로세서에 대한 성능검증 결과 IEEE 802.11a WLAN 표준에 대해 기존 DSP에 비해서 FFT 연산과 Complex MAC의 경우 약 $33\%$, 비터비 디코딩의 경우 약 $37\%$, 스크램블링 및 길쌈부호화의 경우 약 $48\%\~84\%$의 연산 사이클 감소를 확인하였으며 다양한 통신 알고리즘에 대해 기존 DSP보다 우수한 성능을 나타내었다.
PDF KSCI

Low-Power-Adaptive MC-CDMA Receiver Architecture

Hasan, Mohd.;Arslan, Tughrul;Thompson, John S.
- ETRI Journal
- /
- 제29권1호
- /
- pp.79-88
- /
- 2007
This paper proposes a novel concept of adjusting the hardware size in a multi-carrier code division multiple access (MC-CDMA) receiver in real time as per the channel parameters such as delay spread, signal-to-noise ratio, transmission rate, and Doppler frequency. The fast Fourier transform (FFT) or inverse FFT (IFFT) size in orthogonal frequency division multiplexing (OFDM)/MC-CDMA transceivers varies from 1024 points to 16 points. Two low-power reconfigurable radix-4 256-point FFT processor architectures are proposed that can also be dynamically configured as 64-point and 16-point as per the channel parameters to prove the concept. By tailoring the clock of the higher FFT stages for longer FFTs and switching to shorter FFTs from longer FFTs, significant power saving is achieved. In addition, two 256 sub-carrier MC-CDMA receiver architectures are proposed which can also be configured for 64 sub-carriers in real time to prove the feasibility of the concept over the whole receiver.
PDF

Median 필터를 위한 RMESH 병렬 알고리즘의 설계 (Design of RMESH Parallel Algorithms for Median Filters)

전병문;정창성
- 한국정보처리학회논문지
- /
- 제5권11호
- /
- pp.2845-2854
- /
- 1998
Median 필터는 임계치 분할, 스택킹 특성, 그리고 선형 분리성에 기반하여 이진 영역에서 구현이 가능하다. 본 논문에서는 VLSI 구현에 적합한 변형 가능한 메쉬(RMESH) 구조에서 median 필터링을 위한 1차원 및 2차운 병렬 알고리즘의 성능을 평가한다. 실제로 M 레벨의 1차원 시그널 길이가 N이고 윈도우 폭이 W일 때, 메쉬 구조에서는 $O(Mw^2)$의 시간 복잡도를 갖는 반면 RMESH 구조에서의 알고리즘은 O(Mw) 시간 복잡도를 갖는다. 또한 M 레벨의 2차원 영상의 크기가 $N{\times}N$이고 원도우 크기가 $w{\times}w$라고 가정하면, 본 논문에서 제안한 $N{\times}N$ RMESH 상에서 median 필터링 알고리즘은 $N{\times}N$ 메쉬의 $O(Mw^2)$ 시간 보다 더욱 향상된 O(Mw) 시간에 계산되어진다.
PDF

Evaluation of a Self-Adaptive Voltage Control Scheme for Low-Power FPGAs

Ishihara, Shota;Xia, Zhengfan;Hariyama, Masanori;Kameyama, Michitaka
- JSTS:Journal of Semiconductor Technology and Science
- /
- 제10권3호
- /
- pp.165-175
- /
- 2010
This paper presents a fine-grain supply-voltage-control scheme for low-power FPGAs. The proposed supply-voltage-control scheme detects the critical path in real time with small overheads by exploiting features of asynchronous architectures. In an FPGA based on the proposed supply-voltage-control scheme, logic blocks on the sub-critical path are autonomously switched to a lower supply voltage to reduce the power consumption without system performance degradation. Moreover, in order to reduce the overheads of level shifters used at the power domain interface, a look-up-table without level shifters is employed. Because of the small overheads of the proposed supply-voltage-control scheme and the power domain interface, the granularity size of the power domain in the proposed FPGA is as fine as a single four-input logic block. The proposed FPGA is fabricated using the e-Shuttle 65 nm CMOS process. Correct operation of the proposed FPGA on the test chip is confirmed.
https://doi.org/10.5573/JSTS.2010.10.3.165 인용 PDF KSCI

타원곡선 암호프로세서의 재구성형 하드웨어 구현을 위한 GF(2$^{m}$)상의 새로운 연산기 (A Novel Arithmetic Unit Over GF(2$^{m}$) for Reconfigurable Hardware Implementation of the Elliptic Curve Cryptographic Processor)

김창훈;권순학;홍춘표;유기영
- 한국정보과학회논문지:시스템및이론
- /
- 제31권8호
- /
- pp.453-464
- /
- 2004
In order to solve the well-known drawback of reduced flexibility that is associate with ASIC implementations, this paper proposes a novel arithmetic unit over GF(2$^{m}$ ) for field programmable gate arrays (FPGAs) implementations of elliptic curve cryptographic processor. The proposed arithmetic unit is based on the binary extended GCD algorithm and the MSB-first multiplication scheme, and designed as systolic architecture to remove global signals broadcasting. The proposed architecture can perform both division and multiplication in GF(2$^{m}$ ). In other word, when input data come in continuously, it produces division results at a rate of one per m clock cycles after an initial delay of 5m-2 in division mode and multiplication results at a rate of one per m clock cycles after an initial delay of 3m in multiplication mode respectively. Analysis shows that while previously proposed dividers have area complexity of Ο(m$^2$) or Ο(mㆍ(log$_2$$^{m}$ )), the Proposed architecture has area complexity of Ο(m), In addition, the proposed architecture has significantly less computational delay time compared with the divider which has area complexity of Ο(mㆍ(log$_2$$^{m}$ )). FPGA implementation results of the proposed arithmetic unit, in which Altera's EP2A70F1508C-7 was used as the target device, show that it ran at maximum 121MHz and utilized 52% of the chip area in GF(2$^{571}$ ). Therefore, when elliptic curve cryptographic processor is implemented on FPGAs, the proposed arithmetic unit is well suited for both division and multiplication circuit.
PDF KSCI

Hardware Approach to Fuzzy Inference―ASIC and RISC―

Watanabe, Hiroyuki
- 한국지능시스템학회:학술대회논문집
- /
- 한국퍼지및지능시스템학회 1993년도 Fifth International Fuzzy Systems Association World Congress 93
- /
- pp.975-976
- /
- 1993
This talk presents the overview of the author's research and development activities on fuzzy inference hardware. We involved it with two distinct approaches. The first approach is to use application specific integrated circuits (ASIC) technology. The fuzzy inference method is directly implemented in silicon. The second approach, which is in its preliminary stage, is to use more conventional microprocessor architecture. Here, we use a quantitative technique used by designer of reduced instruction set computer (RISC) to modify an architecture of a microprocessor. In the ASIC approach, we implemented the most widely used fuzzy inference mechanism directly on silicon. The mechanism is beaded on a max-min compositional rule of inference, and Mandami's method of fuzzy implication. The two VLSI fuzzy inference chips are designed, fabricated, and fully tested. Both used a full-custom CMOS technology. The second and more claborate chip was designed at the University of North Carolina(U C) in cooperation with MCNC. Both VLSI chips had muliple datapaths for rule digital fuzzy inference chips had multiple datapaths for rule evaluation, and they executed multiple fuzzy if-then rules in parallel. The AT & T chip is the first digital fuzzy inference chip in the world. It ran with a 20 MHz clock cycle and achieved an approximately 80.000 Fuzzy Logical inferences Per Second (FLIPS). It stored and executed 16 fuzzy if-then rules. Since it was designed as a proof of concept prototype chip, it had minimal amount of peripheral logic for system integration. UNC/MCNC chip consists of 688,131 transistors of which 476,160 are used for RAM memory. It ran with a 10 MHz clock cycle. The chip has a 3-staged pipeline and initiates a computation of new inference every 64 cycle. This chip achieved an approximately 160,000 FLIPS. The new architecture have the following important improvements from the AT & T chip: Programmable rule set memory (RAM). On-chip fuzzification operation by a table lookup method. On-chip defuzzification operation by a centroid method. Reconfigurable architecture for processing two rule formats. RAM/datapath redundancy for higher yield It can store and execute 51 if-then rule of the following format: IF A and B and C and D Then Do E, and Then Do F. With this format, the chip takes four inputs and produces two outputs. By software reconfiguration, it can store and execute 102 if-then rules of the following simpler format using the same datapath: IF A and B Then Do E. With this format the chip takes two inputs and produces one outputs. We have built two VME-bus board systems based on this chip for Oak Ridge National Laboratory (ORNL). The board is now installed in a robot at ORNL. Researchers uses this board for experiment in autonomous robot navigation. The Fuzzy Logic system board places the Fuzzy chip into a VMEbus environment. High level C language functions hide the operational details of the board from the applications programme . The programmer treats rule memories and fuzzification function memories as local structures passed as parameters to the C functions. ASIC fuzzy inference hardware is extremely fast, but they are limited in generality. Many aspects of the design are limited or fixed. We have proposed to designing a are limited or fixed. We have proposed to designing a fuzzy information processor as an application specific processor using a quantitative approach. The quantitative approach was developed by RISC designers. In effect, we are interested in evaluating the effectiveness of a specialized RISC processor for fuzzy information processing. As the first step, we measured the possible speed-up of a fuzzy inference program based on if-then rules by an introduction of specialized instructions, i.e., min and max instructions. The minimum and maximum operations are heavily used in fuzzy logic applications as fuzzy intersection and union. We performed measurements using a MIPS R3000 as a base micropro essor. The initial result is encouraging. We can achieve as high as a 2.5 increase in inference speed if the R3000 had min and max instructions. Also, they are useful for speeding up other fuzzy operations such as bounded product and bounded sum. The embedded processor's main task is to control some device or process. It usually runs a single or a embedded processer to create an embedded processor for fuzzy control is very effective. Table I shows the measured speed of the inference by a MIPS R3000 microprocessor, a fictitious MIPS R3000 microprocessor with min and max instructions, and a UNC/MCNC ASIC fuzzy inference chip. The software that used on microprocessors is a simulator of the ASIC chip. The first row is the computation time in seconds of 6000 inferences using 51 rules where each fuzzy set is represented by an array of 64 elements. The second row is the time required to perform a single inference. The last row is the fuzzy logical inferences per second (FLIPS) measured for ach device. There is a large gap in run time between the ASIC and software approaches even if we resort to a specialized fuzzy microprocessor. As for design time and cost, these two approaches represent two extremes. An ASIC approach is extremely expensive. It is, therefore, an important research topic to design a specialized computing architecture for fuzzy applications that falls between these two extremes both in run time and design time/cost. TABLEI INFERENCE TIME BY 51 RULES {{{{Time }}{{MIPS R3000 }}{{ASIC }}{{Regular }}{{With min/mix }}{{6000 inference 1 inference FLIPS }}{{125s 20.8ms 48 }}{{49s 8.2ms 122 }}{{0.0038s 6.4㎲ 156,250 }} }}
PDF

검색결과 8건 처리시간 0.02초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)