• Title/Summary/Keyword: multiprocessor systems

Search Result 162, Processing Time 0.024 seconds

A Study on the Efficient Task Scheduling by the Reconstructed Task Graph (태스크 그래프의 재구성에 의한 효율적 태스크 스케줄링에 관한 연구)

  • Byun, Seung-Hwan;Yoo, Kwan-Jong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.9
    • /
    • pp.2235-2246
    • /
    • 1997
  • This paper presents an effective heuristic task scheduling algorithm for multiprocessor systems. To execute task scheduling effectively which is defined as an allocation of m's tasks onto n's processors(m > n), several problems almost at NP-hard should be cleaned up. The purpose of the task scheduling obtains the minimum execution time by mapping the tasks on a system topology or reduces the total execution time to give a minimum system topology. In order to solve this problem, in this paper, the task scheduling is done by redefining a task graph to a reconstructed task graph (RTG). An RTG is obtained by merging or copying nodes to equal the number of nodes on each level of the task graph to the number of processors of the system topology and then directly scheduled to the system topology. This method obtains a fast scheduling time and a simple scheduling method, and near-optimal execution time without executing steps such as the refinement step and the duplication step after the task scheduling.

  • PDF

Proposal and Performance Evaluation of A Scalable Scheduling Algorithm According to the Number of Parallel Processors (병렬 처리장치의 개수에 따른 스케줄링 알고리즘의 제안 및 성능평가)

  • Gyung-Leen Park;Sang Joon Lee;BongKyu Lee
    • Journal of Internet Computing and Services
    • /
    • v.1 no.2
    • /
    • pp.19-28
    • /
    • 2000
  • The scheduling problem in parallel processing systems has been a challenging research issue for decades. The problem is defined as finding an optimal schedule which minimizes the parallel execution time of an application on a target multiprocessor system. Duplication Based Scheduling (DBS) is a relatively new approach for solving the problem. The DBS algorithms are capable of reducing communication overhead by duplicating remote parent tasks on local processors. Most of DBS algorithms assume an availability of the unlimited number of processors in the system. Since the assumption may net hold in practice, the paper proposes a new scalable DBS algorithm for a target system with limited number of processors. It Is shown that the proposed algorithm with N available processors generates the same schedule as that obtained by the algorithm with unlimited number of processors, where N is the number of input tasks. Also, the performance evaluation reveals that the proposed algorithm shows a graceful performance degradation as the number of available processors in the system is decreased.

  • PDF

Implementation and Performance Analysis of Efficient Packet Processing Method For DPI (Deep Packet Inspection) System using Dual-Processors (듀얼 프로세서 기반 DPI (Deep Packet Inspection) 엔진을 위한 효율적 패킷 프로세싱 방안 구현 및 성능 분석)

  • Yang, Joon-Ho;Han, Seung-Jae
    • The KIPS Transactions:PartC
    • /
    • v.16C no.4
    • /
    • pp.417-422
    • /
    • 2009
  • Implementation of DPI(Deep Packet Inspection) system on a general purpose multiprocessor platform is an attractive option from the implementation cost point of view, since it does not require high-cost customized hardware. Load balancing has been considered as a primary means to achieve high performance in multi processor systems. We claim, however, that in case of DPI system design simply balancing the load of each processor does not necessarily yield the highest system performance. Instead, we propose a method in which tasks are allocated to processors based on their functions. We implemented the proposed method in dual processor Linux system and compare its performance with the existing load balancing methods. Under the proposed method, one processor is dedicated to deal with interrupt handling and generic packet processing, while another processor is dedicated to DPI processing. According to experimental results, the proposed scheme outperforms the existing schemes by 60%, mainly because of the reduction of cache miss and spin lock occurrences.

Energy-aware EDZL Real-Time Scheduling on Multicore Platforms (멀티코어 플랫폼에서 에너지 효율적 EDZL 실시간 스케줄링)

  • Han, Sangchul
    • Journal of KIISE
    • /
    • v.43 no.3
    • /
    • pp.296-303
    • /
    • 2016
  • Mobile real-time systems with limited system resources and a limited power source need to fully utilize the system resources when the workload is heavy and reduce energy consumption when the workload is light. EDZL (Earliest Deadline until Zero Laxity), a multiprocessor real-time scheduling algorithm, can provide high system utilization, but little work has been done aimed at reducing its energy consumption. This paper tackles the problem of DVFS (Dynamic Voltage/Frequency Scaling) in EDZL scheduling. It proposes a technique to compute a uniform speed on full-chip DVFS platforms and individual speeds of tasks on per-core DVFS platforms. This technique, which is based on the EDZL schedulability test, is a simple but effective one for determining the speeds of tasks offline. We also show through simulation that the proposed technique is useful in reducing energy consumption.

The Design of MPI Hardware Unit for Enhanced Broadcast Communication (효율적인 브로드캐스트 통신을 지원하는 MPI 하드웨어 유닛 설계)

  • Yun, Hee-Jun;Chung, Won-Young;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.36 no.11B
    • /
    • pp.1329-1338
    • /
    • 2011
  • This paper proposes an algorithm and hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional systems, collective communication is converted into point-to-point communications by MPI library cell without considering the state of communication port of each processing node which represents the processing node is in busy state or free state. If conflicting point-to-point communication occurs during broadcast communication, the transmitting speed for broadcast communication is decreased. Thus, this paper proposed an algorithm which determines the order of point-to-point communications for broadcast communication according to the state of each processing node. According to the state of each processing node, the proposed algorithm decreases total broadcast communication time by transmitting message preferentially to the processing node with communication port in free state. The proposed MPI unit for broadcast communication is evaluated by modeling it with systemC. In addition, it achieved a highly improved performance for broadcast communication up to 78% with 16 nodes. This result shows the proposed algorithm is useful to improving total performance of MPSoC.

A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System (MPI 브로드캐스트 통신을 위한 서킷 스위칭 기반의 파이프라인 체인 알고리즘 설계)

  • Yun, Heejun;Chung, Wonyoung;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37B no.9
    • /
    • pp.795-805
    • /
    • 2012
  • This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

Efficient Hardware Support: The Lock Mechanism without Retry (하드웨어 지원의 재시도 없는 잠금기법)

  • Kim Mee-Kyung;Hong Chul-Eui
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.9
    • /
    • pp.1582-1589
    • /
    • 2006
  • A lock mechanism is essential for synchronization on the multiprocessor systems. The conventional queuing lock has two bus traffics that are the initial and retry of the lock-read. %is paper proposes the new locking protocol, called WPV (Waiting Processor Variable) lock mechanism, which has only one lock-read bus traffic command. The WPV mechanism accesses the shared data in the initial lock-read phase that is held in the pipelined protocol until the shared data is transferred. The nv mechanism also uses the cache state lock mechanism to reduce the locking overhead and guarantees the FIFO lock operations in the multiple lock contentions. In this paper, we also derive the analytical model of WPV lock mechanism as well as conventional memory and cache queuing lock mechanisms. The simulation results on the WPV lock mechanism show that about 50% of access time is reduced comparing with the conventional queuing lock mechanism.

Fault Diagnosis Using t/k-Diagnosable System in Hypercube Networks (t/k-진단 시스템을 사용한 하이퍼큐브 네트워크의 결함 진단)

  • Kim, Jang-Hwan;Rhee, Chung-Sei
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.31 no.11C
    • /
    • pp.1044-1051
    • /
    • 2006
  • System level diagnosis algorithms use the properties of t-diagnosable system where the maximum number of the faults does not exceed 1. The existing diagnosis algorithms have limit when dealing with large fault sets in large multiprocessor systems. Somani and Peleg proposed t/k-diagnosable system to diagnose more faults than t by allowing upper bounded few number of units to be diagnosed incorrectly. In this paper, we propose adaptive hypercube diagnosis algorithm using t/k-diagnosable system. When the number of faults exceeds t, we allow k faults to be diagnosed incorrectly. Simulation shows that the performance of the proposed algorithm is better than Feng's HADA algorithm. We propose new algorithm to reduce test rounds by analyzing the syndrome of RGC-ring obtained in the first step of HADA/IHADA method. The proposed algorithm also gives similar performance compared to HYP-DIAG algorithm.

Scalable CC-NUMA System using Repeater Node (리피터 노드를 이용한 Scalable CC-NUMA 시스템)

  • Kyoung, Jin-Mi;Jhang, Seong-Tae
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.29 no.9
    • /
    • pp.503-513
    • /
    • 2002
  • Since CC-NUMA architecture has to access remote memory, the interconnection network determines the performance of the CC-NUMA system. Bus which has been used as a popular interconnection network has many limits in a large-scale system because of the limited physical scalability and bandwidth. The dual ring interconnection network, composed of high-speed point-to-point links, is made to resolve the defects of the bus for the large-scale system. However, it also has a problem, in that the response latency is rapidly increased when many nodes are attached to the snooping based CC-NUMA system with the dual ring. In this paper, we propose a ring architecture with repeater nodes in order to overcome the problem of the dual ring on a snooping based CC-NUMA system, and design a repeater node adapted to this architecture. We will also analyze the effects of proposed architecture on the system performance and the response latency by using a probability-driven simulator.

The PALM system : Architecture and Network Performance (PALM시스템의 구조와 네트웍 성능)

  • Kim, Suk-Il
    • The Transactions of the Korea Information Processing Society
    • /
    • v.1 no.1
    • /
    • pp.105-113
    • /
    • 1994
  • This paper introduces the Parallel Advanced Loosely coupled Multiprocessor (PALM) architecture, which is based on HCH(m,p), where m is number of links per a communication processor (CP) and p is the number of application processors (APs) connected to the CP. communication links between a pair of CPs and/or between a CP and an AP, are made of dual-Port RAMs, which provide fast and reliable word-parallel communication between processors. Among the wide spectrum of HCH networks, HCH(m,2) is also known to be a cost optimal topology, such that HCH(m,2) consists of the largest number of APs retaining the minimal number of CPs and communication links. We also implement a testbed based on HCH(2,2). The experiment result shows that the small communication/computation ratio of the PALM system would realize fine-grain parallelism on message-passing MIMD systems.

  • PDF