Combining replication and checkpointing redundancies for reducing resiliency overhead

Motallebi, Hassan;

doi:10.4218/etrij.2018-0684

ETRI Journal

제42권3호
/
Pages.388-398
/
2020
/
1225-6463(pISSN)
/
2233-7326(eISSN)

한국전자통신연구원 (Electronics and Telecommunications Research Institute)

DOI QR Code

Combining replication and checkpointing redundancies for reducing resiliency overhead

Motallebi, Hassan (Department of Electrical and Computer Engineering, Graduate University of Advanced Technology)

투고 : 2018.12.03
심사 : 2019.08.14
발행 : 2020.06.08

https://doi.org/10.4218/etrij.2018-0684 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

We herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault-tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real-world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.

키워드

참고문헌

F. Cappello, Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities, Int. J. High Perform. Comput. Appl. 23 (2009), 212-226. https://doi.org/10.1177/1094342009106189
D. P. Chandrashekar, Robust and Fault-Tolerant Scheduling for Scientific Workflows in Cloud Computing Environments, Ph.D. dissertation, Dept. Computing and Inf. Syst., University of Melbourne, Melbourne, Australia, 2015.
G. Aupy et al., Checkpointing strategies for scheduling computational workflows, Int. J. Network. Comput. 6 (2016), 2-26. https://doi.org/10.15803/ijnc.6.1_2
R. N. Calheiros and R. Buyya, Meeting deadlines of scientific workflows in public clouds with tasks replication, IEEE Trans. Parallel Dist. Syst. 25 (2014), 1787-1796. https://doi.org/10.1109/TPDS.2013.238
K. Plankensteiner and R. Prodan, Meeting soft deadlines in scientific workflows using resubmission impact, IEEE Trans. Parallel Dist. Syst. 23 (2012), 890-901. https://doi.org/10.1109/TPDS.2011.221
M. T. Rahman et al., Check pointing to minimize completion time for inter-dependent parallel processes on volunteer grids, in Proc. IEEE/ACM Int. Symp. Cluster, Cloud Grid Comput. (Cartagena, Colombia), May 16-19, 2016, pp. 331-335.
J. W. Young, A first order approximation to the optimum check point interval, Commun. ACM 17 (1974), 530-531. https://doi.org/10.1145/361147.361115
J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Gener. Comp. Syst. 22 (2006), 303-312. https://doi.org/10.1016/j.future.2004.11.016
M.-S. Bouguerra et al., A flexible checkpoint, restart model in distributed systems, in Proc. Parallel Process. Appl. Math. (Wroclaw, Poland), Sept. 13-16 (2009), pp. 206-215.
A. Benoit, M. Hakem, and Y. Robert, Fault tolerant scheduling of precedence task graphs on heterogeneous platforms, in Proc. Int. Symp. Parallel Distrib. (Miami, FL, USA), Apr. 14-18, 2008, pp. 1-8.
H. Topcuoglu, S. Hariri, and M.-Y. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEE Trans. Parallel Dist. Syst. 13 (2002), 260-274. https://doi.org/10.1109/71.993206
S. K. Jayadivya, J. S. Nirmala, and M. S. S. Bhanu, Fault tolerant workflow scheduling based on replication and resubmission of tasks in cloud computing, Int. J. Comput. Sci. Eng. 4 (2012), 996-1006.
R. Sirvent, R. M. Badia, and J. Labarta, Graph-based Task Replication for Workflow Applications, in Proc. IEEE Int. Conf. High Performance Comput. Commun. (Seoul, Rep. of Korea), June 25-27, 2009, pp. 20-28.
S. Abrishami, M. Naghibzadeh, and D. H. J. Epema, Deadline constrained workflow scheduling algorithms for infrastructure as a service clouds, Future Gener. Comp. Syst. 29 (2013), 158-169. https://doi.org/10.1016/j.future.2012.05.004
L. Zhao, Y. Ren, and K. Sakurai, Reliable workflow scheduling with less resource redundancy, Parallel Comput. 39 (2013), 567-585. https://doi.org/10.1016/j.parco.2013.06.003
M. Wieczorek, R. Prodan, and A. Hoheisel, Taxonomies of the multi-criteria grid workflow scheduling problem, Institute on Resource Management and Scheduling, Innsbruck, Austria, Core-GRID Tech. Rep. TR-0106, Aug. 2007.
Y. Zhang et al., Combined fault tolerance and scheduling techniques for workflow applications on computational grids, in Proc. IEEE/ACM Int. Symp. Clust. Comput. Grid (Shanghai, China), May 18-21, 2009, pp. 244-251.
A. Benoit et al., Combining checkpointing and replication for reliable execution of linear workflows, In Proc. IEEE Int. Paallel Distrib. Process. Symp. Workshops (Vancouver, Canada), May 2018, pp. 793-802.
A. Benoit et al., Optimal check- pointing period with replicated execution on heterogeneous platforms, in Proc. Workshop FTXS@ HPDC (Washington, DC, USA), June 26-27, 2017, pp. 9-16.
M. Chtepen et al., Adaptive task checkpointing and replication: toward efficient fault-tolerant grids, IEEE Trans. Parallel Distrib. Syst. 20 (2009), 180-190. https://doi.org/10.1109/TPDS.2008.93
J. Daly, A model for predicting the optimum checkpoint interval for restart dumps, in Proc. Int. Conf. Comput. Sci. (Melbourne, Australia), June 2-4, 2003, pp. 3-12.
S. Sadi and B. Yagoubi, Communication-aware approaches for transparent checkpointing in cloud computing, Scalable Comput.: Practice Experience 17 (2016), 251-270.
M. Bougeret et al., Checkpointing strategies for parallel jobs, in Proc. Int. conf. Hight Performance Comput. Netw. Storage Anal. (Seattle, WA, USA), Nov. 12-18, 2011, pp. 1-11.
G. Aupy and J. Herrmann, Periodicity in optimal hierarchical checkpointing schemes for adjoint computations, Optim Methods Softw. 32, (2017), 594-624. https://doi.org/10.1080/10556788.2016.1230612
H. Nguyen et al., An execution environment for robust parallel computing on volunteer PC Grids, in Proc. Int. Conf. Parallel Process. (Pittsburgh, PA, USA), Sept. 10-13, 2012, pp. 158-167.
G. Aupyet al., On the Combination of Silent Error Detection and Check-pointing, in Proc. IEEE Pacific Rim Int. Symp. Dependable Comput. (Vancouver, Canada), Dec. 2-4, 2013, pp. 11-20.
A. Benoit et al., Two-level check-pointing and verifications for linear task graphs, in Proc. IEEE Int. Parallel Distrib. Process. Symp. (Chicago, IL, USA), May 23-27, 2016, pp. 1239-1248.
A. Benoit et al., Multi-level check- pointing and silent error detection for linear workflows, J. Comput. Sci. 28 (2018), 398-415. https://doi.org/10.1016/j.jocs.2017.03.024
L. Han et al., A generic approach to scheduling and checkpointing workflows, in Proc. Int. Conf. Parallel Process. (Eugene, OR, USA), July 29-Aug. 3, 2018, pp. 1-10.
S. Sadi and B. Yagoubi, On the optimum checkpointing interval selection for variable size checkpoint dumps, in Proc. Int. Conf. Comput. Sci. Applicat. (Saida, Algeria), May 20-21, 2015, pp. 599-610.
G. Juve et al., Characterizing and profiling scientific workflows, Future Gener. Comput. Syst. 29 (2013), 682-692. https://doi.org/10.1016/j.future.2012.08.015

ETRI Journal

Combining replication and checkpointing redundancies for reducing resiliency overhead

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)