Аннотация:Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. However, it can suffer from the domino effect, which causes the rollback of one process to potentially propagate to others. In this paper we present an adaptive checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing. Our algorithm is based on proofs of the conditions necessary and sufficient for a checkpoint to belong to some consistent global checkpoint, previously an open question. We characterize these conditions with a generalization of Lamport's happened-before relation called a zigzag path. Our algorithm tracks zigzag paths on-line and checkpoints when certain paths are detected. Experiments on an iPSC/860 hypercube show that our algorithm reduces the average rollback required to recover from any fault to less than one checkpoint interval per process, and checkpoints only 4% more often than traditional periodic checkpointing algorithms. We thus eliminate rollback propagation without the runtime overhead of coordinated checkpoints or other schemes that attempt to reduce rollback propagation.< >