References
- G. Aupy, A. Benoit, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni, On the combination of silent error detection and checkpointing, The 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), 2013.
- L. Bautista-Gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert, and H. Sun, Coping with recall and precision of soft error detectors, J. Parallel and Distributed Computing 98 (2016), 8-24. https://doi.org/10.1016/j.jpdc.2016.07.007
- A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun, Coping with silent and fail-stop errors at scale by combining replication and checkpointing, J. Parallel and Distributed Computing 122 (2018), 209-225. https://doi.org/10.1016/j.jpdc.2018.08.002
- A. Benoit, A. Cavelan, F. Ciorba, V. Le Fevre, and Y. Robert, Combining checkpointing and replication for reliable execution of linear work ows with fail-stop and silent errors, International Journal of Networking and Computing, 9 (2019), no. 1, 2-27. https://doi.org/10.15803/ijnc.9.1_2
- A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Transactions on Parallel Computing, Association for Computing Machinery 3 (2016), no. 2, 1-36.
- A. Benoit, A. Cavelan, Y. Robert, and H. Su, Optimal resilience patterns to cope with fail-stop and silent errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
- A. Benoit, A. Cavelan, Y. Robert, and H. Su, Multi-level checkpointing and silent error detection for linear work ows, J. Comput. Sci. 28 (2018), 398-415. https://doi.org/10.1016/j.jocs.2017.03.024
- A. Benoit, S. K. Raina, and Y. Robert, Effcient checkpoint/verication patterns, International Journal of High Performance Computing Applications 3 (2016), no. 1, 52-65.
- M. S. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, International Conference for High Performance Computing, Networking, Storage and Analysis, United States, 1-11, 2011.
- M. S. Bouguerra, T. Gautier, D. Trystram, and J. M. Vincent, A flexible check-point/restart model in distributed systems, International Conference on Parallel Processing and Applied mathematics (PPAM), LNCS, 6067 (2010), 206-215.
- M. S. Bouguerra, D. Trystram, and F. Wagner, Complexity analysis of checkpoint scheduling with variable costs, IEEE Trans. Comput. 62 (2013), no. 6, 1269-1275. https://doi.org/10.1109/TC.2012.57
- K. M. Chandy and L. Lamport, Determining global states of distributed systems, ACM Transactions on Computer Systems, 3 (1985), no. 1, 63-75. https://doi.org/10.1145/214451.214456
- J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems (FGCS) 22 (2004), no. 3, 303-312. https://doi.org/10.1016/j.future.2004.11.016
- E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, 1 (2004), no. 2, 97-108. https://doi.org/10.1109/TDSC.2004.15
- Y. Ling, J. Mi, and X. Lin, A variational calculus approach to optimal checkpoint placement, IEEE Trans. on Computers, 50 (2001), no. 7, 699-708. https://doi.org/10.1109/12.936236
- G. Lu, Z. Zheng, and A. A. Chien, When is multi-version checkpointing needed, The 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), ACM Press, 2013.
- R. Lucas et al., Top Ten Exascale Challenges, DOE ASCAC Subcommittee Report, U.S. Department of Energy, Oce of Science, 1-86, 2014.
- A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, The 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010), 1-11.
- T. O'Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Transactions on Electron Devices 41 (1994), no. 4, 553-557. https://doi.org/10.1109/16.278509
- T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-free checkpoint placement algorithms based on min-max principle, IEEE Transactions on Dependable and Secure Computing 3 (2006), no. 2, 130-140. https://doi.org/10.1109/TDSC.2006.22
- S. Toueg and Babaoglu, On the optimum checkpoint selection problem, SIAM J. Comput. 13 (1984), no. 3, 630-649. https://doi.org/10.1137/0213039
- J. W. Young, A rst order approximation to the optimal checkpoit interval, Comm. of the ACM, 17 (1974), no. 9, 530-531. https://doi.org/10.1145/361147.361115
- J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics, IBM J. Res. Dev. 40 (1996), no. 1, 3-18. https://doi.org/10.1147/rd.401.0003
- J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. O'Gorman, and J. Ross, Accelerated testing for cosmic soft-error rate, IBM J. Res. Dev. 40 (1996), no. 1, 51-72. https://doi.org/10.1147/rd.401.0051
- J. F. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos, H. P. Muhlfeld, and C. J. Montrose, Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits 33 (1998), no. 2, 246-252. https://doi.org/10.1109/4.658626