Topic: Checkpoint-Recovery (2024)

Checkpoint-Recovery
Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999Author: John DeVale

Abstract

Traditional fault tolerance methods involving the checkpointing of systemstate and restoring it in the case of a system fault is one method availableto system designers whose goal is the creation of a robust, fault-tolerantsystem. While checkpoint-recovery may not be ideal for many embedded systemsdue to time or space constraints, it might be useable if the system is designedwith checkpoint-recovery in mind. Techniques such as memory exclusion anddesign for checkpointing may allow embedded system designers judicious useof checkpoint-rollback techniques. This approach to fault-tolerance protectsagainst the widely used fault model of transient hardware failures. Additionally,with the addition of some recovery feedback and algorithmic diversity, systemsthat tolerate software design faults could possibly be built.

Contents

Introduction
Key Concepts
Saving Executive State
Restoring Executive State
Failure Detection
Available Tools, Techniques, and Metrics
LibCKPT
LibFT
Other Tools
Metrics
Relationships to Other Topics
Conclusions
References
Further Reading

Introduction

Checkpoint-Recovery is a common technique for imbuing a program or systemwith fault tolerant qualities, and grew from the ideas used in systems whichemploy transaction processing [lyu95]. It allows systems to recover aftersome fault interrupts the system, and causes the task to fail, or be abortedin some way. While many systems employ the technique to minimize lost processingtime, it can be used more broadly to tolerate and recover from faults ina critical application or task.

The basic idea behind checkpoint-recover is the saving and restorationof system state. By saving the current state of the system periodicallyor before critical code sections, it provides the baseline information neededfor the restoration of lost state in the event of a system failure. Whilethe cost of checkpoint-recovery can be high, by using techniques like memoryexclusion, and by designing a system to have as small a critical state aspossible may minimize the cost of checkpointing enough to be useful in evencost sensitive embedded applications.

When a system is checkpointed, the state of the entire system is savedto non-volatile storage. The checkpointing mechanism takes a snapshot ofthe system state and stores the data on some non-volatile storage medium.Clearly, the cost of a checkpoint will vary with the amount of state requiredto be saved and the bandwidth available to the storage mechanism being usedto save the state.

In the event of a system failure, the internal state of the system canbe restored, and it can continue service from the point at which its statewas last saved. Typically this involves restarting the failed task or system,and providing some parameter indicating that there is state to be recovered.Depending on the task complexity, the amount of state, and the bandwidthto the storage device this process could take from a fraction of a secondto many seconds.

This technique provides protection against the transient fault model.Typically upon state restoration the system will continue processing inan identical manner as it did previously. This will tolerate any transientfault, however if the fault was caused by a design error, then the systemwill continue to fail and recover endlessly. In some cases, this may bethe most important type of fault to guard against, but not in every case.

Unfortunately, it has only limited utility in the presence of a softwaredesign fault. Consider for instance a system which performs control calculations,one of which is to divide a temperature reading into some value. Since thespecification requires the instrument to read out in degrees Kelvin (absolutetemperature), a temperature of 0 is not possible. In this case the programmer(realizing this) fails to check for zero prior to performing the divide.The system works well for a few months, but then the temperature gauge fails.The manufacturer realizes that a 0K temperature is not possible, and decidesthat the gauge should fail low, since a result of 0 is obviously indicativeof a failure. The system faults, and attempts to recover its state. Unfortunately,it reaches the divide instruction and faults, and continues to recover andfault until some human intervention occurs. The point here is not that thereshould be redundant temperature sensors, but that the most common formsof checkpoint and recovery are not effective against some classes of failures.

Key Concepts

The basic mechanism of checkpoint-recovery consists of three key ideas- the saving and restoration of executive state, and the detection of theneed to restore system state. Additionally, for more complex distributedembedded systems, the checkpoint-recovery mechanism can be used to migrateprocesses off individual nodes[Checkpoint andMigration of UNIX Processes in the Condor Distributed Processing SystemUniversity of Wisconsin-Madison Computer Sciences Technical Report #1346,April 1997.].

Saving executive state

A snapshot of the complete program state may be scheduled periodicallyduring program execution. Typically this is accomplished by pausing theoperation of the process whose state is to be saved, and copying the memorypages into non-volatile storage. While this can be accomplished by usingfreely available checkpoint-recovery libraries, it may be more efficientto build a customized mechanism into the system to be protected.

Between full snapshots, or even in place of all but the first completeshot, only that state which has changed may be saved. This is known as incrementalcheckpointing [plank96], and can be thought of in the same way as incrementalbackups of hard disks. The basic idea here is to minimize the cost of checkpointing,both in terms of the time required and the space (on non-volatile storage).

Not all program state may need to be saved. System designers may findit more efficient to build in mechanisms to regenerate state internally,based on a smaller set of saved state. Although this technique might bedifficult for some applications, it has the benefit of having the potentialto save both time and space during both the checkpoint and recovery operations.

A technique known as memory exclusion [plank96] allows a program to notifythe checkpoint algorithm which memory areas are state critical and whichare not. This technique is similar to that of rebuilding state discussedabove, in that it facilitates saving only the information most criticalto program state. The designer can exclude large working set arrays, stringconstants, and other similar memory areas from being checkpointed.

When these techniques are combined, the cost of checkpointing can bereduced by factors of 3-4[plank96]. Checkpointing, like any fault tolerantcomputing technique, does require additional resources. Whether or not itwill work well, is high dependant on both the target system design, andthe application. Typically those systems which must meet hard real-timedeadlines will have the most difficulty implementing any type of checkpoint-recoverysystem

Restoring executivestate

When a failure has occurred, the recovery mechanism restores system stateto the last checkpointed value. This is the fundamental idea in the toleranceof a fault within a system employing checkpoint-recovery. Ideally, the statewill be restored to a condition before the fault occurred within the system.After the state has been restored, the system can continue normal execution.

State is restored directly from the last complete snapshot, or reconstructedfrom the last snapshot and the incremental checkpoints. The concept is similarto that of a journaled file system, or even RCS(revision control system),in that only the changes to a file are recorded. Thus when the file is tobe loaded or restored, the original document is loaded, and then the specifiedchanges are made to it. In a similar fashion, when the state is restoredto a system which has undergone one or more incremental checkpoints, thelast full checkpoint is loaded, and then modified according to the statechanges indicated by the incremental checkpoint data.

If the root cause of the failure did not manifest until after a checkpoint,and that cause is part of the state or input data, the restored system islikely to fail again. In such a case the error in the system may be latentthrough several checkpoint cycles. When the it finally activates and causesa system failure, the recovery mechanism will restore the state (includingthe error!) and execution will begin again, most likely triggering the sameactivation and failure. Thus it is in the system designers best interestto ensure that any checkpoint-recovery based system is fail fast - meaningerrors are either tolerated, or case the system to fail immediately, withlittle or no incubation period.

Such recurring failures might be addressed through multi-level rollbacksand/or algorithmic diversity. Such a system would detect multiple failuresas described above, and recover state from checkpoint data previous to thelast recovery point. Additionally, when the system detects such multiplefailures it might switch to a different algorithm to perform its functionality,which may not be susceptible to the same failure modes. The system mightdegrade its performance by using a more robust, but less efficient algorithmin an attempt to provide base level functionality to get past the faultbefore switching back to the more efficient routines.

Failure Detection

Failure detection can be a tricky part of any fault tolerant design.Sometimes the line between an unexpected (but correct) result, and garbageout is difficult to discern. In traditional checkpoint-recovery failuredetection is somewhat simplistic. If the process or system terminates, thereis a failure. Additionally, some systems will recover state if they attempteda non-transactional operation that failed and returned. The discussion offailure detection, and especially how it impacts embedded systems is leftto the chapters on fault tolerance, reliability, dependability, and architecture.

Available Tools, Techniques,and Metrics

There is a wide range of freely available tools to aid in checkpoint-recovery.These tools vary from compiler assisted static and dynamic checkpoint placement,to user level checkpoint-recovery libraries.

Libckpt

The libckpt checkpointing library developed at University of TennesseeKnoxville[plank 95] provides a rich set of checkpoint-recovery tools includingmemory exclusion and incremental checkpointing. Although it is mainly intendedfor use on UNIX workstations, the movement toward the use of COTS operatingsystems in embedded systems may allow designers to use this free library.

Virtually transparent to the programmer, the only changes required arerenaming the main program module to ckpt_target() and if desired, callsto the memory exclusion routines and user-directed checkpointing. Runtimeconfiguration via an initialization (.rc) file allows the checkpoint mechanismto be configured without the need to be recompiled of linked. Options include:

Checkpointing (on or off)
Forked (concurrent) checkpointing
Incremental Checkpointing (on or off)
Max and min time between checkpoints
Max number of incrementals between full checkpoints
Checkpoint compression (LZW data compression of state data)

libFT

AT&T research labs offers libFT, a comprehensive checkpoint-recoverysystem which includes user level checkpointing, and watchdog processes.While it does not offer as advanced and comprehensive a feature set as libckpt,libFT provides solid user level checkpoint and recovery services to processes.The addition of a watchdog utility allows easy detection and recovery fromabnormal process termination, and is capable of monitoring and recoveringmany processes within a system.

Other Tools

Condor - University of Wisconsin project, mainly aimed at process migration and load balancing. New work on "Process Hijacking" allows migration of processes not intended to be migratory. http://www.cs.wisc.edu/condor
PMCKPT (Poor man's checkpoint) - Another free checkpointing system. http://warp.dcs.st-andrews.ac.uk/warp/systems/checkpoint/source.html
PORCH (portable checkpoint compiler)- Massachusetts Institute of Technology project consisting of a compiler which reads in a c-program, analyzes it, and re-emits the source with additional code to perform checkpoint-recovery. Project is in public beta at the time of this writing.
CATCH (compiler assisted techniques for checkpointing) University of Illinois at Urbana-Champaign technology which heuristically attempts to adaptively generate sparse potential checkpoint code to minimize the cost of checkpointing in terms of data size and time overhead[li 90].

Metrics

While there are no real benchmark style metrics for checkpointing systems,the performance of any system is measurable. The critical parameters arethe speed at which the state can be saved and read back, and how long normaloperation is suspended for either a checkpoint or a restore. The resultsare highly dependant on the application and platform, as well as how thecheckpoint is implemented. For instance a checkpoint-recovery version ofmatrix multiply can range from an average checkpoint time of 330 millisecondsper checkpoint using all the optimizations to 2.24 seconds [plank 95].

Relationshipsto other Topics

Fault Tolerant Computing

Checkpoint - Rollback is a technique which can be used to build fault tolerance into a computing system.
In its current form it very capably saves process state and can create a new process and restore old state to it in the case of a process failure.

SW fault tolerance

Related to SW fault tolerance by sharing a common goal.
Scope of the solutions are on a much different scale.
SW fault tolerance focuses more on making software not crash.
Traditional checkpointing focuses on recovering from the crash in a graceful manner while preserving computational state and critical data.

Conclusions

Checkpoint-recovery is a simple and effective method to add fault tolerance to a system, especially when the system is designed with it in mind.
May not be ideal for embedded applications, especially those with real-time requirements due to recovery overhead, but work on this is in progress[Xu96].
May not be effective for tolerating software design faults.
Has trouble when state includes more than just values in memory and registers - for instance disk accesses, and producer-consumer problems.

Annotated References

[Lyu 95]	M. R. Lyu, ed., Software Fault Tolerance Chichester, England:John Wiley and Sons, Inc., 1995. Good discussion on the theoretical problems and bounds of checkpointing.Includes discussion of the cost of checkpointing, and recovery given statisticalfailure rates.
[Russinovich95]	Russinovich, M., Segall, Z., Application-Transparent Checkpointing inMach 3.0/UX, Proceedings of the 28th Annual Hawaii International Conferenceon System Sciences, 1995 Describes a checkpointing implementation for Mach, and experimentallydetermines cost. Discusses how their system allows checkpointing at a systemlevel, transparent to the application.
[Xu96]	Xu, J., Randell, B., Roll-forward error recovery in embedded real-timesystems, Proceedings. 1996 International Conference on Parallel and DistributedSystems Real time embedded systems may not have enough time for traditional checkpointrecovery operations. Xu and Randell explore roll-forward recovery to easethe timing constraints on a system.
[Plank95]	Plank, J., Beck, M., Kingsley, G., Li, K., Libckpt: Transparent Checkpointingunder Unix, USENIX Winter Technical Conference, 1995 Describes the public domain checkpoint library libckpt. Discussion includesits wide array of features, and performance optimizing.
[Plank96]	Plank, J., Chen, Y., Li, K., Beck, M., Kingsley, G., Memory Exclusion:Optimizing the Performance of Checkpointing Systems, Technical Report UT-CS-96-335,University of Tennessee, August, 1996 Discusses the technique of memory exclusion by which a checkpointingsystem can exclude process state which is not critical from the checkpointoperation. Performance impact is quantified for several example applications.

Further Reading

Gray, Jim,. Reuter, Andreas,. Transaction processing: concepts and techniques, San Mateo, Calif. : Morgan Kaufmann Publishers,c1993. ISBN: 1558601902