Search | Korea Science

Software Pipeline-Based Partitioning Method with Trade-Off between Workload Balance and Communication Optimization

Huang, Kai;Xiu, Siwen;Yu, Min;Zhang, Xiaomeng;Yan, Rongjie;Yan, Xiaolang;Liu, Zhili
- ETRI Journal
- /
- v.37 no.3
- /
- pp.562-572
- /
- 2015
For a multiprocessor System-on-Chip (MPSoC) to achieve high performance via parallelism, we must consider how to partition a given application into different components and map the components onto multiple processors. In this paper, we propose a software pipeline-based partitioning method with cyclic dependent task management and communication optimization. During task partitioning, simultaneously considering computation load balance and communication optimization can cause interference, which leads to performance loss. To address this issue, we formulate their constraints and apply an integer linear programming approach to find an optimal partitioning result - one that requires a trade-off between these two factors. Experimental results on a reconfigurable MPSoC platform demonstrate the effectiveness of the proposed method, with 20% to 40% performance improvements compared to a traditional software pipeline-based partitioning method.
https://doi.org/10.4218/etrij.15.0114.0502 인용 PDF KSCI

Dynamically Reconfigurable SoC 3-Layer Bus Structure (동적 재구성이 가능한 SoC 3중 버스 구조)

Kim, Kyu-Chull;Seo, Byung-Hyun
- Journal of IKEEE
- /
- v.13 no.2
- /
- pp.101-107
- /
- 2009
Growth in the VLSI process and design technology is resulting into a continuous increase in the number of IPs on a chip to form a system. Because of many IPs on a single chip, efficient communication between IPs is essential. We propose a dynamically reconfigurable 3-layer bus structure which can adapt to the pattern of data transmission to achieve an efficient data communication between various IPs. The proposed 3-layer bus can be reconfigured to multi-single bus mode, and single-multi bus mode, thus providing the benefits of both single-bus and multi-bus modes. Experimental results show that the flexibility of the proposed bus structure can reduce data transmission time compared to the conventional fixed bus structure. We incorporated the proposed bus structure in a JPEG system and verified that the proposed structure achieved an average of 22% improvement in time over the conventional fixed bus structure.
PDF

Novel Reconfigurable Coprocessor for Communication Systems (통신 시스템을 위한 고성능 재구성 가능 코프로세서의 설계)

Jung Chul Yoon;Sunwoo Myung Hoon
- Journal of the Institute of Electronics Engineers of Korea SD
- /
- v.42 no.6 s.336
- /
- pp.39-48
- /
- 2005
This paper proposes a reconfigurable coprocessor for communication systems, which can perform high speed computations and various functions. The proposed reconfigurable coprocessor can easily implement communication operations, such as scrambling, interleaving, convolutional encoding, Viterbi decoding, FFT, etc. The proposed architecture has been modeled by VHDL and synthesized using the SEC 0.18$\mu$m standard cell library. The gate count is about 35,000 gates and the critical path is 3.84ns. The proposed coprocessor can reduced about $33\%$ for FFT operations and complex MAC, $37\%$ for Viterbi operations, and $48\%\~84\%$ for scrambling and convolutional encoding for the IEEE 802.11a WLAN standard compared with existing DSPs. The proposed coprocessor shows Performance improvements compared with existing DSP chips for communication algorithms.
PDF KSCI

Low-Power-Adaptive MC-CDMA Receiver Architecture

Hasan, Mohd.;Arslan, Tughrul;Thompson, John S.
- ETRI Journal
- /
- v.29 no.1
- /
- pp.79-88
- /
- 2007
This paper proposes a novel concept of adjusting the hardware size in a multi-carrier code division multiple access (MC-CDMA) receiver in real time as per the channel parameters such as delay spread, signal-to-noise ratio, transmission rate, and Doppler frequency. The fast Fourier transform (FFT) or inverse FFT (IFFT) size in orthogonal frequency division multiplexing (OFDM)/MC-CDMA transceivers varies from 1024 points to 16 points. Two low-power reconfigurable radix-4 256-point FFT processor architectures are proposed that can also be dynamically configured as 64-point and 16-point as per the channel parameters to prove the concept. By tailoring the clock of the higher FFT stages for longer FFTs and switching to shorter FFTs from longer FFTs, significant power saving is achieved. In addition, two 256 sub-carrier MC-CDMA receiver architectures are proposed which can also be configured for 64 sub-carriers in real time to prove the feasibility of the concept over the whole receiver.
PDF

Design of RMESH Parallel Algorithms for Median Filters (Median 필터를 위한 RMESH 병렬 알고리즘의 설계)

Jeon, Byeong-Moon;Jeong, Chang-Sung
- The Transactions of the Korea Information Processing Society
- /
- v.5 no.11
- /
- pp.2845-2854
- /
- 1998
Median filter can be implemented in the binary domain based on threshold decomposition, stacking property, and linear separability. In this paper, we develop one-dimensional and two-dimensional parallel algorithms for the median filter on a reconfigurable mesh with buses(RMESH) which is suitable for VLSI implementation. And we evaluate their performance by comparing the time complexities of RMESH algorithms with those of algorithms on mesh-connected computer. When the length of M-valued 1-D signal is N and w is the window width, the RMESH algorithm is done in O(Mw) time and mesh algorithm is done in $O(Mw^2)$ time. Beside, when the size of M-valued 2-D image is $N{\times}N$ and the window size is $w{\times}w$, our algorithm on $N{\times}N$ RMESH can be computed in O(Mw) time which is a significant improvement over the $O(Mw^2)$ complexity on $N{\times}N$ mesh.
PDF

Evaluation of a Self-Adaptive Voltage Control Scheme for Low-Power FPGAs

Ishihara, Shota;Xia, Zhengfan;Hariyama, Masanori;Kameyama, Michitaka
- JSTS:Journal of Semiconductor Technology and Science
- /
- v.10 no.3
- /
- pp.165-175
- /
- 2010
This paper presents a fine-grain supply-voltage-control scheme for low-power FPGAs. The proposed supply-voltage-control scheme detects the critical path in real time with small overheads by exploiting features of asynchronous architectures. In an FPGA based on the proposed supply-voltage-control scheme, logic blocks on the sub-critical path are autonomously switched to a lower supply voltage to reduce the power consumption without system performance degradation. Moreover, in order to reduce the overheads of level shifters used at the power domain interface, a look-up-table without level shifters is employed. Because of the small overheads of the proposed supply-voltage-control scheme and the power domain interface, the granularity size of the power domain in the proposed FPGA is as fine as a single four-input logic block. The proposed FPGA is fabricated using the e-Shuttle 65 nm CMOS process. Correct operation of the proposed FPGA on the test chip is confirmed.
https://doi.org/10.5573/JSTS.2010.10.3.165 인용 PDF KSCI

A Novel Arithmetic Unit Over GF(2$^{m}$) for Reconfigurable Hardware Implementation of the Elliptic Curve Cryptographic Processor (타원곡선 암호프로세서의 재구성형 하드웨어 구현을 위한 GF(2$^{m}$)상의 새로운 연산기)

김창훈;권순학;홍춘표;유기영
- Journal of KIISE:Computer Systems and Theory
- /
- v.31 no.8
- /
- pp.453-464
- /
- 2004
In order to solve the well-known drawback of reduced flexibility that is associate with ASIC implementations, this paper proposes a novel arithmetic unit over GF(2$^{m}$ ) for field programmable gate arrays (FPGAs) implementations of elliptic curve cryptographic processor. The proposed arithmetic unit is based on the binary extended GCD algorithm and the MSB-first multiplication scheme, and designed as systolic architecture to remove global signals broadcasting. The proposed architecture can perform both division and multiplication in GF(2$^{m}$ ). In other word, when input data come in continuously, it produces division results at a rate of one per m clock cycles after an initial delay of 5m-2 in division mode and multiplication results at a rate of one per m clock cycles after an initial delay of 3m in multiplication mode respectively. Analysis shows that while previously proposed dividers have area complexity of Ο(m$^2$) or Ο(mㆍ(log$_2$$^{m}$ )), the Proposed architecture has area complexity of Ο(m), In addition, the proposed architecture has significantly less computational delay time compared with the divider which has area complexity of Ο(mㆍ(log$_2$$^{m}$ )). FPGA implementation results of the proposed arithmetic unit, in which Altera's EP2A70F1508C-7 was used as the target device, show that it ran at maximum 121MHz and utilized 52% of the chip area in GF(2$^{571}$ ). Therefore, when elliptic curve cryptographic processor is implemented on FPGAs, the proposed arithmetic unit is well suited for both division and multiplication circuit.
PDF KSCI

Hardware Approach to Fuzzy Inference―ASIC and RISC―

Watanabe, Hiroyuki
- Proceedings of the Korean Institute of Intelligent Systems Conference
- /
- 1993.06a
- /
- pp.975-976
- /
- 1993
This talk presents the overview of the author's research and development activities on fuzzy inference hardware. We involved it with two distinct approaches. The first approach is to use application specific integrated circuits (ASIC) technology. The fuzzy inference method is directly implemented in silicon. The second approach, which is in its preliminary stage, is to use more conventional microprocessor architecture. Here, we use a quantitative technique used by designer of reduced instruction set computer (RISC) to modify an architecture of a microprocessor. In the ASIC approach, we implemented the most widely used fuzzy inference mechanism directly on silicon. The mechanism is beaded on a max-min compositional rule of inference, and Mandami's method of fuzzy implication. The two VLSI fuzzy inference chips are designed, fabricated, and fully tested. Both used a full-custom CMOS technology. The second and more claborate chip was designed at the University of North Carolina(U C) in cooperation with MCNC. Both VLSI chips had muliple datapaths for rule digital fuzzy inference chips had multiple datapaths for rule evaluation, and they executed multiple fuzzy if-then rules in parallel. The AT & T chip is the first digital fuzzy inference chip in the world. It ran with a 20 MHz clock cycle and achieved an approximately 80.000 Fuzzy Logical inferences Per Second (FLIPS). It stored and executed 16 fuzzy if-then rules. Since it was designed as a proof of concept prototype chip, it had minimal amount of peripheral logic for system integration. UNC/MCNC chip consists of 688,131 transistors of which 476,160 are used for RAM memory. It ran with a 10 MHz clock cycle. The chip has a 3-staged pipeline and initiates a computation of new inference every 64 cycle. This chip achieved an approximately 160,000 FLIPS. The new architecture have the following important improvements from the AT & T chip: Programmable rule set memory (RAM). On-chip fuzzification operation by a table lookup method. On-chip defuzzification operation by a centroid method. Reconfigurable architecture for processing two rule formats. RAM/datapath redundancy for higher yield It can store and execute 51 if-then rule of the following format: IF A and B and C and D Then Do E, and Then Do F. With this format, the chip takes four inputs and produces two outputs. By software reconfiguration, it can store and execute 102 if-then rules of the following simpler format using the same datapath: IF A and B Then Do E. With this format the chip takes two inputs and produces one outputs. We have built two VME-bus board systems based on this chip for Oak Ridge National Laboratory (ORNL). The board is now installed in a robot at ORNL. Researchers uses this board for experiment in autonomous robot navigation. The Fuzzy Logic system board places the Fuzzy chip into a VMEbus environment. High level C language functions hide the operational details of the board from the applications programme . The programmer treats rule memories and fuzzification function memories as local structures passed as parameters to the C functions. ASIC fuzzy inference hardware is extremely fast, but they are limited in generality. Many aspects of the design are limited or fixed. We have proposed to designing a are limited or fixed. We have proposed to designing a fuzzy information processor as an application specific processor using a quantitative approach. The quantitative approach was developed by RISC designers. In effect, we are interested in evaluating the effectiveness of a specialized RISC processor for fuzzy information processing. As the first step, we measured the possible speed-up of a fuzzy inference program based on if-then rules by an introduction of specialized instructions, i.e., min and max instructions. The minimum and maximum operations are heavily used in fuzzy logic applications as fuzzy intersection and union. We performed measurements using a MIPS R3000 as a base micropro essor. The initial result is encouraging. We can achieve as high as a 2.5 increase in inference speed if the R3000 had min and max instructions. Also, they are useful for speeding up other fuzzy operations such as bounded product and bounded sum. The embedded processor's main task is to control some device or process. It usually runs a single or a embedded processer to create an embedded processor for fuzzy control is very effective. Table I shows the measured speed of the inference by a MIPS R3000 microprocessor, a fictitious MIPS R3000 microprocessor with min and max instructions, and a UNC/MCNC ASIC fuzzy inference chip. The software that used on microprocessors is a simulator of the ASIC chip. The first row is the computation time in seconds of 6000 inferences using 51 rules where each fuzzy set is represented by an array of 64 elements. The second row is the time required to perform a single inference. The last row is the fuzzy logical inferences per second (FLIPS) measured for ach device. There is a large gap in run time between the ASIC and software approaches even if we resort to a specialized fuzzy microprocessor. As for design time and cost, these two approaches represent two extremes. An ASIC approach is extremely expensive. It is, therefore, an important research topic to design a specialized computing architecture for fuzzy applications that falls between these two extremes both in run time and design time/cost. TABLEI INFERENCE TIME BY 51 RULES {{{{Time }}{{MIPS R3000 }}{{ASIC }}{{Regular }}{{With min/mix }}{{6000 inference 1 inference FLIPS }}{{125s 20.8ms 48 }}{{49s 8.2ms 122 }}{{0.0038s 6.4㎲ 156,250 }} }}
PDF

Search Result 8, Processing Time 0.022 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)