DOI QR코드

DOI QR Code

Combining replication and checkpointing redundancies for reducing resiliency overhead

  • Motallebi, Hassan (Department of Electrical and Computer Engineering, Graduate University of Advanced Technology)
  • 투고 : 2018.12.03
  • 심사 : 2019.08.14
  • 발행 : 2020.06.08

초록

We herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault-tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real-world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.

키워드

참고문헌

  1. F. Cappello, Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities, Int. J. High Perform. Comput. Appl. 23 (2009), 212-226. https://doi.org/10.1177/1094342009106189
  2. D. P. Chandrashekar, Robust and Fault-Tolerant Scheduling for Scientific Workflows in Cloud Computing Environments, Ph.D. dissertation, Dept. Computing and Inf. Syst., University of Melbourne, Melbourne, Australia, 2015.
  3. G. Aupy et al., Checkpointing strategies for scheduling computational workflows, Int. J. Network. Comput. 6 (2016), 2-26. https://doi.org/10.15803/ijnc.6.1_2
  4. R. N. Calheiros and R. Buyya, Meeting deadlines of scientific workflows in public clouds with tasks replication, IEEE Trans. Parallel Dist. Syst. 25 (2014), 1787-1796. https://doi.org/10.1109/TPDS.2013.238
  5. K. Plankensteiner and R. Prodan, Meeting soft deadlines in scientific workflows using resubmission impact, IEEE Trans. Parallel Dist. Syst. 23 (2012), 890-901. https://doi.org/10.1109/TPDS.2011.221
  6. M. T. Rahman et al., Check pointing to minimize completion time for inter-dependent parallel processes on volunteer grids, in Proc. IEEE/ACM Int. Symp. Cluster, Cloud Grid Comput. (Cartagena, Colombia), May 16-19, 2016, pp. 331-335.
  7. J. W. Young, A first order approximation to the optimum check point interval, Commun. ACM 17 (1974), 530-531. https://doi.org/10.1145/361147.361115
  8. J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Gener. Comp. Syst. 22 (2006), 303-312. https://doi.org/10.1016/j.future.2004.11.016
  9. M.-S. Bouguerra et al., A flexible checkpoint, restart model in distributed systems, in Proc. Parallel Process. Appl. Math. (Wroclaw, Poland), Sept. 13-16 (2009), pp. 206-215.
  10. A. Benoit, M. Hakem, and Y. Robert, Fault tolerant scheduling of precedence task graphs on heterogeneous platforms, in Proc. Int. Symp. Parallel Distrib. (Miami, FL, USA), Apr. 14-18, 2008, pp. 1-8.
  11. H. Topcuoglu, S. Hariri, and M.-Y. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Trans. Parallel Dist. Syst. 13 (2002), 260-274. https://doi.org/10.1109/71.993206
  12. S. K. Jayadivya, J. S. Nirmala, and M. S. S. Bhanu, Fault tolerant workflow scheduling based on replication and resubmission of tasks in cloud computing, Int. J. Comput. Sci. Eng. 4 (2012), 996-1006.
  13. R. Sirvent, R. M. Badia, and J. Labarta, Graph-based Task Replication for Workflow Applications, in Proc. IEEE Int. Conf. High Performance Comput. Commun. (Seoul, Rep. of Korea), June 25-27, 2009, pp. 20-28.
  14. S. Abrishami, M. Naghibzadeh, and D. H. J. Epema, Deadline constrained workflow scheduling algorithms for infrastructure as a service clouds, Future Gener. Comp. Syst. 29 (2013), 158-169. https://doi.org/10.1016/j.future.2012.05.004
  15. L. Zhao, Y. Ren, and K. Sakurai, Reliable workflow scheduling with less resource redundancy, Parallel Comput. 39 (2013), 567-585. https://doi.org/10.1016/j.parco.2013.06.003
  16. M. Wieczorek, R. Prodan, and A. Hoheisel, Taxonomies of the multi-criteria grid workflow scheduling problem, Institute on Resource Management and Scheduling, Innsbruck, Austria, Core-GRID Tech. Rep. TR-0106, Aug. 2007.
  17. Y. Zhang et al., Combined fault tolerance and scheduling techniques for workflow applications on computational grids, in Proc. IEEE/ACM Int. Symp. Clust. Comput. Grid (Shanghai, China), May 18-21, 2009, pp. 244-251.
  18. A. Benoit et al., Combining checkpointing and replication for reliable execution of linear workflows, In Proc. IEEE Int. Paallel Distrib. Process. Symp. Workshops (Vancouver, Canada), May 2018, pp. 793-802.
  19. A. Benoit et al., Optimal check- pointing period with replicated execution on heterogeneous platforms, in Proc. Workshop FTXS@ HPDC (Washington, DC, USA), June 26-27, 2017, pp. 9-16.
  20. M. Chtepen et al., Adaptive task checkpointing and replication: toward efficient fault-tolerant grids, IEEE Trans. Parallel Distrib. Syst. 20 (2009), 180-190. https://doi.org/10.1109/TPDS.2008.93
  21. J. Daly, A model for predicting the optimum checkpoint interval for restart dumps, in Proc. Int. Conf. Comput. Sci. (Melbourne, Australia), June 2-4, 2003, pp. 3-12.
  22. S. Sadi and B. Yagoubi, Communication-aware approaches for transparent checkpointing in cloud computing, Scalable Comput.: Practice Experience 17 (2016), 251-270.
  23. M. Bougeret et al., Checkpointing strategies for parallel jobs, in Proc. Int. conf. Hight Performance Comput. Netw. Storage Anal. (Seattle, WA, USA), Nov. 12-18, 2011, pp. 1-11.
  24. G. Aupy and J. Herrmann, Periodicity in optimal hierarchical checkpointing schemes for adjoint computations, Optim Methods Softw. 32, (2017), 594-624. https://doi.org/10.1080/10556788.2016.1230612
  25. H. Nguyen et al., An execution environment for robust parallel computing on volunteer PC Grids, in Proc. Int. Conf. Parallel Process. (Pittsburgh, PA, USA), Sept. 10-13, 2012, pp. 158-167.
  26. G. Aupyet al., On the Combination of Silent Error Detection and Check-pointing, in Proc. IEEE Pacific Rim Int. Symp. Dependable Comput. (Vancouver, Canada), Dec. 2-4, 2013, pp. 11-20.
  27. A. Benoit et al., Two-level check-pointing and verifications for linear task graphs, in Proc. IEEE Int. Parallel Distrib. Process. Symp. (Chicago, IL, USA), May 23-27, 2016, pp. 1239-1248.
  28. A. Benoit et al., Multi-level check- pointing and silent error detection for linear workflows, J. Comput. Sci. 28 (2018), 398-415. https://doi.org/10.1016/j.jocs.2017.03.024
  29. L. Han et al., A generic approach to scheduling and checkpointing workflows, in Proc. Int. Conf. Parallel Process. (Eugene, OR, USA), July 29-Aug. 3, 2018, pp. 1-10.
  30. S. Sadi and B. Yagoubi, On the optimum checkpointing interval selection for variable size checkpoint dumps, in Proc. Int. Conf. Comput. Sci. Applicat. (Saida, Algeria), May 20-21, 2015, pp. 599-610.
  31. G. Juve et al., Characterizing and profiling scientific workflows, Future Gener. Comput. Syst. 29 (2013), 682-692. https://doi.org/10.1016/j.future.2012.08.015