• Title/Summary/Keyword: MPSoC

Search Result 41, Processing Time 0.023 seconds

The Design of MPI Hardware Unit for Enhanced Broadcast Communication (효율적인 브로드캐스트 통신을 지원하는 MPI 하드웨어 유닛 설계)

  • Yun, Hee-Jun;Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.11B
    • /
    • pp.1329-1338
    • /
    • 2011
  • This paper proposes an algorithm and hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional systems, collective communication is converted into point-to-point communications by MPI library cell without considering the state of communication port of each processing node which represents the processing node is in busy state or free state. If conflicting point-to-point communication occurs during broadcast communication, the transmitting speed for broadcast communication is decreased. Thus, this paper proposed an algorithm which determines the order of point-to-point communications for broadcast communication according to the state of each processing node. According to the state of each processing node, the proposed algorithm decreases total broadcast communication time by transmitting message preferentially to the processing node with communication port in free state. The proposed MPI unit for broadcast communication is evaluated by modeling it with systemC. In addition, it achieved a highly improved performance for broadcast communication up to 78% with 16 nodes. This result shows the proposed algorithm is useful to improving total performance of MPSoC.

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System (MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계)

  • Yun, Heejun;Chung, Wonyoung;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37B no.9
    • /
    • pp.795-805
    • /
    • 2012
  • This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

An Efficient Hardware Implementation of 257-bit Point Scalar Multiplication for Binary Edwards Curves Cryptography (이진 에드워즈 곡선 공개키 암호를 위한 257-비트 점 스칼라 곱셈의 효율적인 하드웨어 구현)

  • Kim, Min-Ju;Jeong, Young-su;Shin, Kyung-Wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.246-248
    • /
    • 2022
  • Binary Edwards curves (BEdC), a new form of elliptic curves proposed by Bernstein, satisfy the complete addition law without exceptions. This paper describes an efficient hardware implementation of point scalar multiplication on BEdC using projective coordinates. Modified Montgomery ladder algorithm was adopted for point scalar multiplication, and binary field arithmetic operations were implemented using 257-bit binary adder, 257-bit binary squarer, and 32-bit binary multiplier. The hardware operation of the BEdC crypto-core was verified using Zynq UltraScale+ MPSoC device. It takes 521,535 clock cycles to compute point scalar multiplication.

  • PDF

Voltage Selection Methodology for DVFS Overhead Minimization (동적 전압 주파수 스케일링 오버헤드 최소화를 위한 전압 선택 방법론)

  • Chang, Jin Kyu;Han, Tae Hee
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.854-857
    • /
    • 2015
  • As the number of devices integrated on system-on-chip(SoC) increases exponentially, energy reduction technology is essential. Dynamic Voltage and Frequency Scaling (DVFS) is a very effective technique for reducing power consumption. Since it requires complex voltage regulators and PLL circuits, DVFS tends to have significant overheads. In this paper, we propose a new voltage selection algorithm to minimize transition overhead for multiprocessor SoC (MPSoC). Simulation results show that proposed algorithm appears less energy consumption with transition overhead even though maintains performance.

  • PDF

Software Pipeline-Based Partitioning Method with Trade-Off between Workload Balance and Communication Optimization

  • Huang, Kai;Xiu, Siwen;Yu, Min;Zhang, Xiaomeng;Yan, Rongjie;Yan, Xiaolang;Liu, Zhili
    • ETRI Journal
    • /
    • v.37 no.3
    • /
    • pp.562-572
    • /
    • 2015
  • For a multiprocessor System-on-Chip (MPSoC) to achieve high performance via parallelism, we must consider how to partition a given application into different components and map the components onto multiple processors. In this paper, we propose a software pipeline-based partitioning method with cyclic dependent task management and communication optimization. During task partitioning, simultaneously considering computation load balance and communication optimization can cause interference, which leads to performance loss. To address this issue, we formulate their constraints and apply an integer linear programming approach to find an optimal partitioning result - one that requires a trade-off between these two factors. Experimental results on a reconfigurable MPSoC platform demonstrate the effectiveness of the proposed method, with 20% to 40% performance improvements compared to a traditional software pipeline-based partitioning method.

A Design of 256-bit Modular Multiplier using 3-way Toom-Cook Multiplication Algorithm and Fast Reduction Algorithm (3-way Toom-Cook 곱셈 알고리듬과 고속 축약 알고리듬을 이용한 256-비트 모듈러 곱셈기 설계)

  • Yang, Hyeon-Jun;Shin, Kyung-Wook
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.223-225
    • /
    • 2021
  • Modular multiplication is a key operation for point scalar multiplication of ECC, and is the most important factor affecting the performance of ECC processor. This paper describes a design of a 256-bit modular multiplier that adopts 3-way Toom-Cook multiplication algorithm and modified fast reduction algorithm. One 90-bit multiplier and three 264-bit adders were used to optimize the hardware size and the number of clock cycles required. The modular multiplier was verified by implementing it using Zynq UltraScale+ MPSoC device and the modular multiplication operation takes 15 clock cycles.

  • PDF

Design of Message Passing Engine Based on Processing Node Status for MPI Collective Communication (MPI 집합통신을 위한 프로세싱 노드 상태 기반의 메시지 전달 엔진 설계)

  • Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.8B
    • /
    • pp.668-676
    • /
    • 2012
  • In this paper, on the assumption that MPI collective communication function is converted into a group of point-to-point communication functions in the transaction level, an algorithm that optimizes broadcast, scatter and gather function among MPI collective communication is proposed. The MPI hardware engine that operates the proposed algorithm was designed, and it was named the OCC-MPE (Optimized Collective Communication Message Passing Engine). The OCC-MPE operates point-to-point communication by using the standard send mode. The transmission order is arranged according to the algorithm that proposes the most frequently used broadcast, scatter and gather functions among the collective communications, so the whole communication time is reduced. To measure the performance of the proposed algorithm, the OCC-MPE with the Bus Functional Model (BFM) based on SystemC was designed. After evaluating the performance through the BFM based on SystemC, the proposed OCC-MPE is designed by using VerilogHDL. As a result of synthesizing with the TSMC $0.18{\mu}m$, the gate count of each OCC-MPE is approximately 1978.95 with four processing nodes. That occupies approximately 4.15% in the whole system, which means it takes up a relatively small amount. Improved performance is expected with relatively small amounts of area increase if the OCC-MPE operated by the proposed algorithm is added to the MPSoC (Multi-Processor System on a Chip).

Separated Address/Data Network Design for Bus Protocol compatible Network-on-Chip (버스 프로토콜 호환 가능한 네트워크-온-칩에서의 분리된 주소/데이터 네트워크 설계)

  • Chung, Seungh Ah;Lee, Jae Hoon;Kim, Sang Heon;Lee, Jae Sung;Han, Tae Hee
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.53 no.4
    • /
    • pp.68-75
    • /
    • 2016
  • As the number of cores and IPs increase in multiprocessor system-on-chip (MPSoC), network-on-chip (NoC) has emerged as a promising novel interconnection architecture for its parallelism and scalability. However, minimization of the latency in NoC with legacy bus IPs must be addressed. In this paper, we focus on the latency minimization problem in NoC which accommodates legacy bus protocol based IPs considering the trade-offs between hop counts and path collisions. To resolve this problem, we propose separated address/data network for independent address and data phases of bus protocol. Compared to Mesh and irregular topologies generated by TopGen, experimental results show that average latency and execution time are reduced by 19.46% and 10.55%, respectively.

A Deadlock Free Router Design for Network-on-Chip Architecture (NOC 구조용 교착상태 없는 라우터 설계)

  • Agarwal, Ankur;Mustafa, Mehmet;Shiuku, Ravi;Pandya, A.S.;Lho, Young-Ugh
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.11 no.4
    • /
    • pp.696-706
    • /
    • 2007
  • Multiprocessor system on chip (MPSoC) platform has set a new innovative trend for the System on Chip (SoC) design. With the rapidly approaching billion transistors era, some of the main problem in deep sub-micron technologies characterized by gate lengths in the range of 60-90 nm will arise from non scalable wire delays, errors in signal integrity and un-synchronized communication. These problems may be addressed by the use of Network on Chip (NOC) architecture for future SoC. Most future SoCs will use network architecture and a packet based communication protocol for on chip communication. This paper presents an adaptive wormhole routing with proactive turn prohibition to guarantee deadlock free on chip communication for NOC architecture. It shows a simple muting architecture with five full-duplex, flit-wide communication channels. We provide simulation results for message latency and compare results with those of dimension ordered techniques operating at the same link rates.

A Study on the Parallel Routing in Hybrid Optical Networks-on-Chip (하이브리드 광학 네트워크-온-칩에서 병렬 라우팅에 관한 연구)

  • Seo, Jung-Tack;Hwang, Yong-Joong;Han, Tae-Hee
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.48 no.8
    • /
    • pp.25-32
    • /
    • 2011
  • Networks-on-chip (NoC) is emerging as a key technology to overcome severe bus traffics in ever-increasing complexity of the Multiprocessor systems-on-chip (MPSoC); however traditional electrical interconnection based NoC architecture would be faced with technical limits of bandwidth and power consumptions in the near future. In order to cope with these problems, a hybrid optical NoC architecture which use both electrical interconnects and optical interconnects together, has been widely investigated. In the hybrid optical NoCs, wormhole switching and simple deterministic X-Y routing are used for the electrical interconnections which is responsible for the setup of routing path and optical router to transmit optical data through optical interconnects. Optical NoC uses circuit switching method to send payload data by preset paths and routers. However, conventional hybrid optical NoC has a drawback that concurrent transmissions are not allowed. Therefore, performance improvement is limited. In this paper, we propose a new routing algorithm that uses circuit switching and adaptive algorithm for the electrical interconnections to transmit data using multiple paths simultaneously. We also propose an efficient method to prevent livelock problems. Experimental results show up to 60% throughput improvement compared to a hybrid optical NoC and 65% power reduction compared to an electrical NoC.