期刊名称:International Journal of Networking and Computing
印刷版ISSN:2185-2847
出版年度:2019
卷号:9
期号:1
页码:2-27
出版社:International Journal of Networking and Computing
摘要:Large-scale platforms currently experience errors from two different sources, namely fail-stop
errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt
data). This work combines checkpointing and replication for the reliable execution of linear
workflows on platforms subject to these two error types. While checkpointing and replication
have been studied separately, their combination has not yet been investigated despite its promising
potential to minimize the execution time of linear workflows in error-prone environments.
Moreover, combined checkpointing and replication has not yet been studied in the presence of
both fail-stop and silent errors. The combination raises new problems: for each task, we have
to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide
an optimal dynamic programming algorithm of quadratic complexity to solve both problems.
This dynamic programming algorithm has been validated through extensive simulations that
reveal the conditions in which checkpointing only, replication only, or the combination of both
techniques, lead to improved performance.
其他摘要:Large-scale platforms currently experience errors from two di?erent sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear work?ows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear work?ows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance.