[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing in vanilla



On 1/19/06, Thomas Materna <materna@xxxxxxxxxxxxx> wrote:
> There is no checkpointing in vanilla universe. That is precisely the main
> difference between the vanilla universe and the standard universe!

That is not strictly speaking true.

There is no access by default to the automatic checkpointing provided
by the relinking in condor_compile.
Even in the vanilla universe you can make use of the condor supplied
checkpointing code (see previous posts or the manual for this).

This is of course only possible on supported (i.e. the more common
*nix variants).

However even if this is unavailable to you (say windows users) you can
still checkpoint. you just have to do it all yourself (and jump
through some hoops on windows).

Specifically you can checkpoint in the vanilla universe on windows by
responding to the WM_CLOSE event (by respond I mean kill yourself
after doing any necessary storing of state). The time you have to
respond is determined by the KILL expression. If this evaluates to
True before you exit condor views you as not having checkpointed. It
will therefore not bother to pull back the contents of the root
directory.

If you can checkpoint without any need for condor to manage the stored
state for you then you can ignore all the above.

This really is a (very) short intro to the process. It takes a lot of
tweaking to get right! but it *is* possible.

> Suspend and resume do exactly what they say. They do not checkpoint, that
> means the do not save the status of the job in a file so that it can be
> continued elsewhere. Suspend and resume only happen on the same machine.

indeed - you can view these as just 'pausing' the process by starving
it of all CPU (note that the memory footprint remains, though it is
likely to get pushed to the pagefile)

> Condor will not resume the vanilla job where it left, but it can restart it
> automatically from the beginning.

If checkpointing has occurred the process is still started as before
(exactly the same args etc.) but your code needs to spot that a
checkpoint has happened (normally by some indicator file being present
as a result of it being written in the checkpoint stage.

> If you want checkpointing, you have to abandon vanilla universe, otherwise,
> I am afraid you are going to have to live with it.

If you want checkpointing with very little effort you *will* have to
abandon the vanilla checkpointing. If you are willing to go through
the learning curve, code changes and setup tweaks needed to get it
working then it works and is a good idea since condor's behaviour is
better suited to checkpointable jobs*

Matt

* this is true for most distributed systems with low hardware
reliability provisions to be honest