DOI QR코드

DOI QR Code

Low-Cost Causal Message Logging based Recovery Algorithm Considering Asynchronous Checkpointing

비동기적 검사점 기록을 고려한 저 비용 인과적 메시지 로깅 기반 회복 알고리즘

  • 안진호 (경기대학교 정보과학부 전자계산학과) ;
  • 방승준 (경기대학교 정보과학부 전자계산학과)
  • Published : 2006.12.31

Abstract

Compared with the previous recovery algorithms for causal message logging, Elnozahy's recovery algerian considerably reduces the number of stable storage accesses and enables live processes to execute their computations continuously while performing its recovery procedure. However, if causal message logging is used with asynchronous checkpointing, the state of the system may be inconsistent after having executed this algorithm in case of concurrent failures. In this paper, we show these inconsistent cases and propose a low-cost recovery algorithm for causal message logging to solve the problem. To ensure the system consistency, this algorithm allows the recovery leader to obtain recovery information from not only the live processes, but also the other recovering processes. Also, the proposed algorithm requires no extra message compared with Elnozahy's one and its additional overhead incurred by message piggybacking is significantly low. To demonstrate this, simulation results show that the first only increases about 1.0%$\sim$2.1% of the recovery information collection time compared with the latter.

인과적 메시지 로깅을 위한 기존 회복 알고리즘들에 비해, Elnozahy가 제안한 회복 알고리즘은 안전한 저장소 접근횟수를 매우 줄이고, 회복과정을 수행하는 동안 살아있는 프로세스들이 자신의 계산을 계속해서 수행할 수 있도록 한다. 그러나, 인과적 메시지 로깅 기법이 비동기적 검사점 기록 기법과 함께 사용된다면, 동시적 고장들이 발생하는 경우 이 알고리즘 수행 후 전체 시스템 상태가 일관적이지 못하게 될 수 있다. 본 논문에서는 이러한 일관적이지 못한 경우들을 보여주고, 이러한 문제점을 해결하는 인과적 메시지 로깅을 위한 저 비용의 회복 알고리즘을 제안한다. 시스템 일관성을 보장하기 위해, 이 알고리즘은 회복 리더가 모든 살아있는 프로세스들뿐만 아니라 다른 회복 프로세스들로부터 회복정보를 얻을 수 있도록 한다. 또한, 제안된 알고리즘은 Elnozahy 회복 알고리즘에 비해 어떠한 부가적인 메시지도 요구하지 않으며, 메시지 피기백에 의해 발생되는 제안된 알고리즘의 부가적인 비용이 매우 낮다 이를 입증하기 위해, 시뮬레이션 결과는 제안된 알고리즘이 Elnozahy 알고리즘에 비해 회복정보 수집시간을 단지 1.0%$\sim$2.1% 정도로 증가시킴을 보여준다.

Keywords

References

  1. L. Alvisi, B. hoppe and K. Marzullo, ' Nonblocking and Orphan-Free Message Logging Protocols,' In Proc. of the 23th Symposium on Fault-Tolerant Computing, pp. 145-154, 1993 https://doi.org/10.1109/FTCS.1993.627318
  2. R. Bagrodis, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin and H. Y. Song, 'Parsec: A Parallel Simulation Environ-ments for Complex Systems,' IEEE Computer, pp.77-85, 1998 https://doi.org/10.1109/2.722293
  3. K. Bhatia, K. Marzullo and L. Alvisi, 'Scalable Causal Message Logging for Wide-Area Environments,' Con-Currency and Computation: Practice and Experience, Vol.15, No.3, pp.873-889, August, 2003 https://doi.org/10.1002/cpe.737
  4. A. Bouteiller, F. Cappello, T. Hearault, G. Krawezik, P. Lemarinier and F. Magniette, ' MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging,' In Proc. of the 15th International Conference on High Performance Networking and Computing(SC2003), November, 2003 https://doi.org/10.1109/SC.2003.10027
  5. K. M. Chandy, and L. Lamport, 'Distributed Snapshots: Determining Global States of Distributed Systems,' ACM Transactions on Computer Systems, Vol.3, No.1, pp.63-75, 1995 https://doi.org/10.1145/214451.214456
  6. E. N. Elnozahy, ' On the Relevance of Communication Costs of Rollback Recovery Protocols,' In Proc. the 15th ACM Symposium on Principles of Distributed Computing, pp. 74-79, 1995 https://doi.org/10.1145/224964.224973
  7. E. N. Elnozahy, L. Alvisi, Y.M. Wang and D. B. Johnson, 'A Survey of Rollback-Recovery Protocols in Message-Passing Systems,' ACM Computing Surveys, Vol.34, No.3, pp.375-408, 2002 https://doi.org/10.1145/568522.568525
  8. E. N. Elnozahy and W. Zwaenepoel, 'Manetho: Transparent rollback-recovery with low overhead, limied rollback and fast output commit,' IEEE Transactions on Computer, Vol.41, pp.526-531, 1992 https://doi.org/10.1109/12.142678
  9. B. Lee, T. Park, H. Y. Yeom and Y. Cho, 'On the impossibility of non-blocking consistent causal recovery,' IEICE Transcations on Information Systems, Vol. E83-D, No.2, pp.291-294, 2000
  10. J. R. Mitchell and V. Grag, 'A non-blocking recovery algorithm for causal message logging,' In Proc. of the 17th Symposium on Reliable Distributed Systems, pp.3-9, 1998 https://doi.org/10.1109/RELDIS.1998.740468
  11. M. L. Powell and D. L. Presotto, 'Publishing : A reliable broadcast communication mechanism,' In Proc, of the 9th International Symposium on Operating System Principle, pp.100-109, 1983 https://doi.org/10.1145/800217.806618
  12. R. D. Schlichting and F. B. Schneider, 'Fail-stop processors: an approach to designing fault-tolerant distributed computing systems,' ACM Transactions on Computer System, Vol.1, pp.222-238, 1985 https://doi.org/10.1145/357369.357371
  13. R. B. Storm and S. Yemeni, 'Optimistic recovery in distributed systems,' ACM Transactions on Computer Systems, Vol.3, pp.204-226, 1985 https://doi.org/10.1145/3959.3962
  14. S. Venkatesan, 'Optimistic crash recovery without changing application messages,' IEEE Transactions on Parallel and Distributed Systems, Vol.8, pp.263-271, 1997 https://doi.org/10.1109/71.584092
  15. B. Yao, K. F. Ssu and W. K. Fuchs, 'Message Logging in Mobile Computing,' In Proc of the 29th International Symposium on Fault Tolerant Computing, pp.14-19, 1999 https://doi.org/10.1109/FTCS.1999.781064