Re: [Condor-users] checkpointing in windows

Is there a way, if your application is able to generate it's own checkpoint
data (a restart file for example) to 'simulate' checkpointing under windows?
i.e. Get Condor to periodically copy selected application generated files
and retrieve them when execution restarts on a new machine? Perhaps a flag
in the job ClassAd to say RESTART_FILES = restart1.ext,restart2.ext

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matt Hope
Sent: 24 February 2006 09:16
To: Condor-Users Mail List
Subject: Re: [Condor-users] checkpointing in windows

On 2/23/06, Kerbel, Kit <kkerbel@xxxxxxxxxxxxxxx> wrote:
> Does anyone know a timeline for when checkpointing might be possible 
> in windows...as it is a bit useless to me for my purposes as is...the 
> cluster could work for 2 weeks straight, crash and the lose all work that
was done.
> Any ideas are more than welcome.

I am not a member of the condor development team but :

Given how complex this is don't expect it any time soon (how I would love to
be proved wrong on this!), indeed I would be tempted to say that, unless you
are capable of supplying serious amounts of funding (or have some body which
is willing to do it) then ice skating to work will be the devil's way of
avoiding fuel price rises before windows gets a proper standard universe.

This applies only to standard universe style checkpointing of course.
you can do your own in response to the WM_CLOSE event. this is rather more
tricky to set up (lots more config must be set correctly for it to actually
work when you try) but is perfectly possible. You just need to be able to
save your state somehow and restart from said saved state*.

I am looking to see if binary serialization in .Net 2.0 is totally sorted.
Make *everything* in your app [Serializable] then just write your entire
object graph, event handlers, anonymous delegates and all to a file when you
are at a well defined point where you can 'recreate' your position in the
call stack.


* Oh how that glosses over it but there you go :)

