DOI QR코드

DOI QR Code

Fault-Tolerance Improvement of Real-Time Embedded System using Static Checkpointing

실시간 임베디드 시스템의 결함 허용성 개선을 위한 정적 체크포인팅 방안

  • 유상문 (군산대학교 전자정보공학부)
  • Published : 2007.12.01

Abstract

This paper deals with a scheme for fault-tolerance improvement of real-time embedded systems, which engages an equidistant checkpointing technique to tolerate transient errors. Transient errors are caused by transient faults which are the most significant type of fault in reliable computer systems. Transient faults are assumed to occur according to a Poisson process and to be detected in a non-concurrent manner (e.g., checked periodically). The probability of the successful real-time task completion in the presence of transient errors is derived with the consideration of the possible effects of the transient errors. Based on this, a condition under which inserting checkpoints improves the fault-tolerance of the system is introduced and an optimal equidistant checkpointing strategy that achieves the highest fault tolerance is presented.

Keywords

References

  1. R. Harboe-Sorensen, E. Daly, F. Teston, H. Schweitzer, R. Nartallo, P. Perol, F. Vandenbussche, H. Dzitko, and J. Cretolle, 'Observation and analysis of single event effects on-board the SOHO satellite,' IEEE Trans. Nuclear Science, vol. 49, no. 3, pp. 1345-1350, Jun., 2002 https://doi.org/10.1109/TNS.2002.1039665
  2. A. Taber and E. Normand, 'Single event upset in avionics,' IEEE Trans. Nuclear Science, vol. 40, no. 2, pp. 120-126, Apr., 1993 https://doi.org/10.1109/23.212327
  3. E. Normand, 'Signle event upset at ground level,' IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2742-2750, Dec., 1996 https://doi.org/10.1109/23.556861
  4. D. P. Siewiorek, Reliable Computer Systems: Design and Evaluation, A K Peters, 1998
  5. E. Dupont, M. Nicolaidis, and P. Rohr, 'Embedded robustness IPs for transient-error-free ICs,' IEEE Design & Test of Computers, vol. 19, pp. 56-70, May-Jun., 2002 https://doi.org/10.1109/MDT.2002.1033793
  6. B. Randell, 'System structure for software fault tolerance,' IEEE Trans. Software Engineering, vol. 1, no. 2, pp. 220-232, June, 1975
  7. K. G. Shin, T.-H. Lin, and Y.-H. Lee, 'Optimal checkpointing of real-time tasks,' IEEE Trans. Computers, vol. C-36, no. 11, pp. 1328-1341, Nov, 1987 https://doi.org/10.1109/TC.1987.5009472
  8. Z. Li, H. Chen and S. Yu, 'Performance optimization for energy-aware adaptive checkpointing in embedded real-time systems,' Proc. Design, Automation and Test in Europe 2006, vol. 1, pp.611, Mar, 2006
  9. R. Geist, R. Reynolds, and J. Westall, 'Selection of a checkpoint interval in a critical-task environment,' IEEE Trans. Reliability, vol. 37, no. 4, pp. 395-400, Nov., 1988 https://doi.org/10.1109/24.9847
  10. S. Punnekkat, A. Bums, and R. Davis, 'Analysis of checkpointing for real-time systems,' The Int'l Journal of Time Critical Computing Systems (Real-Time Systems), vol. 20, no. 1, pp. 83-102, Jan, 2001
  11. R. Melhem, D. Mosse, and E. Elnozahy, 'The interplay of power management and fault recovery in real-time systems,' IEEE Trans. Computers, vol. 53, no. 2, pp.217-231, Feb., 2004 https://doi.org/10.1109/TC.2004.1261830
  12. Y. Zhang and K. Chakrabarty, 'Dynamic adaptation for fault tolerance and power management in embedded real-time systems,' ACM Trans. Embedded Computing Systems, vol. 3, no. 2, pp. 336-360, May 2004 https://doi.org/10.1145/993396.993402
  13. V. K. Stefanidis and K. G. Margaritis, 'Algorithm based fault tolerance: Review and experimental study,' Int'l Conference of Numerical Analysis and Applied Mathematics 2004 (ICNAAM 2004), 2004
  14. P. P. Shirvani, N. R. Saxena, and E. J. McCluskey, 'Software-implemented EDAC protection against SEUs,' IEEE Trans. Reliability, vol. 49, no. 3, pp. 273-284, Sep, 2000 https://doi.org/10.1109/24.914544
  15. R. Stroph and T. Clarke, 'Dynamic acceptance tests for complex controllers,' Proc. 24th Euromicro Conference, vol. 1, pp. 411-417, Aug, 1998
  16. J. Sosnowski, 'Transient fault tolerance in digital systems,' IEEE Micro, vol. 14, no. 1,pp. 24-35, Feb, 1994
  17. C. M. Krishna and K. G. Shin, Real-Time Systems, McGraw-Hill, 1997
  18. C. N. Hadjicostis, 'Finite-state machine embeddings for nonconcurrent error detection and identification,' IEEE Trans. Automatic Control, vol. 50, no. 2, pp. 142-153, Feb., 2005 https://doi.org/10.1109/TAC.2004.841887
  19. B. W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley, 1989