首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
  • 本地全文:下载
  • 作者:Tanzima Zerin Islam ; Kathryn Mohror ; Saurabh Bagchi
  • 期刊名称:Scientific Programming
  • 印刷版ISSN:1058-9244
  • 出版年度:2013
  • 卷号:21
  • 期号:3-4
  • 页码:149-163
  • DOI:10.1155/2013/341672
  • 出版社:Hindawi Publishing Corporation
  • 摘要:

    High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.

  • 关键词:Data-aware; checkpoint restart; distributed applications; distributed systems; fault tolerance; aggregation; bottleneck;; multiple-processor systems; application-level checkpointing; rollback recovery; system reliability; distributed programming; fault; tolerant computing; software reliability; system recovery
国家哲学社会科学文献中心版权所有