Transparent Checkpoints of Closed Distributed Systems in Emulab

Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau

Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys) 2009.

DOI: 10.1145/1519065.1519084

areas
Networking, Operating Systems, Virtualization, Testbeds

abstract

Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent tension between these goals, however, and in some aspects of the testbed's design, Emulab's implementers favored realism over control. Thus, Emulab provides wide-ranging control over an experiment's environment and initial conditions, but relatively little control over its execution—in particular, the ability to suspend, preempt, or replay the experiment.

We have extended Emulab with a new means of control over experiment execution: the ability to cleanly checkpoint the execution of the set of nodes and networks that comprise an experiment. Conventional checkpoint mechanisms can easily degrade the fidelity of experiment results as a consequence of checkpoint downtimes, overheads of background state saving, and unintended distributed checkpoint synchronization effects. In this paper we demonstrate a checkpointing technique that is transparent with respect to the execution of the system under test, almost completely concealing the underlying checkpoint activity.

Building on our checkpoint mechanism, we have implemented two powerful facilities for experiment execution control: the ability to preemptively swap-out experiments without losing their run-time state, and the ability to time-travel through the run of a system.