• Title/Summary/Keyword: 체크포인팅

Search Result 37, Processing Time 0.026 seconds

Checkpoint-based Job Migration Technique in Mobile Grids (모바일 그리드에서 체크포인트 기반 작업 이주 기법)

  • Jung, Dae-Yong;Suh, Tae-Weon;Chung, Kwang-Sik;Yu, Heon-Chang
    • The Journal of Korean Association of Computer Education
    • /
    • v.12 no.4
    • /
    • pp.47-55
    • /
    • 2009
  • There are many researches considering mobile devices as resources in mobile grids. However, the mobile device has some limitations: wireless connection and battery capacity. So, the grid operations using mobile devices have lower reliability and efficiency than those in fixed grid environments. In this paper, we propose a job migration scheme using mobile devices to overcome these limitations. The proposed job migration scheme predicts failure condition during execution and takes checkpoints. Then, if the failure occurs on mobile device during execution, the executing job can be migrated to other mobile device by checkpoint information. To perform the proposed migration scheme, we establish a mobile device manager on a proxy server and a status manager on a mobile device. Connection, wireless signal strength and battery capacity of mobile devices are identified through two managers. The simulation results show improvement of efficiency and reliability during execution.

  • PDF

Remote Logging for Fault-Tolerant Software Distributed Shared Memory (소프트웨어 분산공유메모리의 고장 허용을 위한 원격 로깅 기법)

  • 박소연;김영재;맹승렬
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.04a
    • /
    • pp.70-72
    • /
    • 2003
  • 소프트웨어 분산공유메모리 시스템의 성능이 높아짐에 따라 최근에는 큰 규모의 클러스터 상에서 사용되는 경우가 많아졌다. 그러나 시스템 규모가 커지면서 고장이 발생하는 가능성도 높아졌다. 시스템의 가용성을 높이기 위하여 고장 허용 기능을 제공하는 분산공유메모리 시스템이 요구되었으며 체크포인팅과 더불어 메시지 로깅에 대한 많은 연구가 이루어져 왔다. 본 논문에서는 고속의 네트웍을 이용하여 원격 노드의 메모리에 로깅하는 방범과 복구 방법을 제안하고 구현을 통하여 성능을 보인다. 원격 로깅은 디스크 접근을 요구하지 않으므로 오버헤드가 적으며 제한적으로 다중 노드의 고장을 허용한다.

  • PDF

Performance Evaluation and Optimization of Journaling File Systems with Multicores and High-Performance Flash SSDs (멀티코어 및 고성능 플래시 SSD 환경에서 저널링 파일 시스템의 성능 평가 및 최적화)

  • Han, Hyuck
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.4
    • /
    • pp.178-185
    • /
    • 2018
  • Recently, demands for computer systems with multicore CPUs and high-performance flash-based storage devices (i.e., flash SSD) have rapidly grown in cloud computing, surer-computing, and enterprise storage/database systems. Journaling file systems running on high-performance systems do not exploit the full I/O bandwidth of high-performance SSDs. In this article, we evaluate and analyze the performance of the Linux EXT4 file system with high-performance SSDs and multicore CPUs. The system used in this study has 72 cores and Intel NVMe SSD, and the flash SSD has performance up to 2800/1900 MB/s for sequential read/write operations. Our experimental results show that checkpointing in the EXT4 file system is a major overhead. Furthermore, we optimize the checkpointing procedure and our optimized EXT4 file system shows up to 92% better performance than the original EXT4 file system.

A Relative Performance Index-based Job Migration in Grid Computing Environment (그리드 컴퓨팅 환경에서의 상대성능지수에 기반한 작업 이주)

  • Kim Young-Gyun;Oh Gil-Ho;Cho Kum Won;Ko Soon-Heum
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.11 no.4
    • /
    • pp.293-304
    • /
    • 2005
  • In this paper, we research on job migration in a grid computing environment with cactus and MPICH-C2 based on Globus. Our concepts are to perform job migration by finding the site with plenty of computational resources that would decrease execution time in a grid computing environment. The Migration Manager recovers the job from the checkpointing files and restarts the job on the migrated site. To select a migrating site, the proposed method considers system's performance index, cpu's load, network traffic to send migration job tiles and the execution time predicted on a migration site. Then it selects a site with maximal performance gains. By selecting a site with minimum migration time and minimum execution time. this approach implements a more efficient grid computing environment. The proposed method Is proved by effectively decreasing total execution time at the $K\ast{Grid}$.

An Adaptive Checkpointing Scheme for Fault Tolerance of Real-Time Control Systems with Concurrent Fault Detection (동시 결함 검출 기능이 있는 실시간 제어 시스템의 결함 허용성을 위한 적응형 체크포인팅 기법)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.17 no.1
    • /
    • pp.72-77
    • /
    • 2011
  • The checkpointing scheme is a well-known technique to cope with transient faults in digital systems. This paper proposes an adaptive checkpointing scheme for the reliability improvement of real-time control systems with concurrent fault detection capability. With concurrent fault detection capability the effect of transient faults are assumed to be detected with no latency. The proposed adaptive checkpointing scheme is based on the reliability analysis of an equidistant checkpointing scheme. Numerical data show the proposed adaptive scheme outperforms the equidistant scheme from a reliability point of view.

An Adaptive Checkpointing Scheme for Fault Tolerance of Real-Time Control Systems (실시간 제어 시스템의 결함 허용성을 위한 적응형 체크포인팅 기법)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.15 no.6
    • /
    • pp.598-603
    • /
    • 2009
  • The checkpointing scheme is a well-known technique to cope with transient faults in digital systems. This paper proposes an adaptive checkpointing scheme for the reliability improvement of real-time control systems. The proposed adaptive checkpointing scheme is based on the previous work about the reliability problem of an equidistant checkpointing scheme. For the derivation of the adaptive scheme, some conditions are introduced which are to be satisfied for the reliability improvement by exploiting an equidistant checkpointing scheme. Numerical data show the proposed adaptive scheme outperforms the equidistant scheme from a reliability point of view.

An Application-Level Fault Tolerant System For Synchronous Parallel Computation (동기 병렬연산을 위한 응용수준의 결함 내성 연산시스템)

  • Park, Pil-Seong
    • Journal of Internet Computing and Services
    • /
    • v.9 no.5
    • /
    • pp.185-193
    • /
    • 2008
  • An MTBF(mean time between failures) of large scale parallel systems is known to be only an order of several hours, and large computations sometimes result in a waste of huge amount of CPU time, However. the MPI(Message Passing Interface), a de facto standard for message passing parallel programming, suggests no possibility to handle such a problem. In this paper, we propose an application-level fault tolerant computation system, purely on the basis of the current MPI standard without using any non-standard fault tolerant MPI library, that can be used for general scientific synchronous parallel computation.

  • PDF

Performance Analysis of Checkpointing and Dual Modular Redundancy for Fault Tolerance of Real-Time Control System (실시간 제어 시스템의 결함 극복을 위한 이중화 구조와 체크포인팅 기법의 성능 분석)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.14 no.4
    • /
    • pp.376-380
    • /
    • 2008
  • This paper deals with a performance analysis of real-time control systems, which engages DMR(dual modular redundancy) to detect transient errors and checkpointing technique to tolerate transient errors. Transient errors are caused by transient faults and the most significant type of errors in reliable computer systems. Transient faults are assumed to occur according to a Poisson process and to be detected by a dual modular redundant structure. In addition, an equidistant checkpointing strategy is considered. The probability of the successful task completion in a real-time control system where periodic checkpointing operations are performed during the execution of a real-time control task is derived. Numerical examples show how checkpoiniting scheme influences the probability of task completion. In addition, the result of the analysis is compared with the simulation result.

Fault-Tolerance Improvement of Real-Time Embedded System using Static Checkpointing (실시간 임베디드 시스템의 결함 허용성 개선을 위한 정적 체크포인팅 방안)

  • Ryu, Sang-Moon
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.13 no.12
    • /
    • pp.1147-1152
    • /
    • 2007
  • This paper deals with a scheme for fault-tolerance improvement of real-time embedded systems, which engages an equidistant checkpointing technique to tolerate transient errors. Transient errors are caused by transient faults which are the most significant type of fault in reliable computer systems. Transient faults are assumed to occur according to a Poisson process and to be detected in a non-concurrent manner (e.g., checked periodically). The probability of the successful real-time task completion in the presence of transient errors is derived with the consideration of the possible effects of the transient errors. Based on this, a condition under which inserting checkpoints improves the fault-tolerance of the system is introduced and an optimal equidistant checkpointing strategy that achieves the highest fault tolerance is presented.

Power-aware Real-time Task Scheduling in Dependable Embedded Systems (신뢰도를 요구하는 임베디드 시스템에서의 저전력 태스크 스케쥴링)

  • Kim, Kyong Hoon;Kim, Yuna;Kim, Jong
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.3 no.1
    • /
    • pp.25-29
    • /
    • 2008
  • In this paper, we provide an adaptive power-aware checkpointing scheme for fixed priority-based DVS scheduling in dependable real-time systems. In the provided scheme, we analyze the minimum number of tolerable faults of a task and the optimal checkpointing interval in order to meet the deadline and guarantee its specified reliability. The energy-efficient voltage level at a fault arrival is also analyzed and used in the recovery of the faulty task.

  • PDF