[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...
- Date: Mon, 19 Sep 2011 15:49:24 -0500
- From: "John (TJ) Knoeller" <johnkn@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...
Currently for vanilla universe jobs, the checkpoint files are only
moved off of the execute machine when the job is vacated or evicted.
In order to get this to work, the submit file for the job needs to have
when_to_transfer_output = ON_EXIT_OR_EVICT
And the submit machine must still be running. (i.e. the librarian needs
to shut that machine down after the execute machines).
That's about the best you can hope for with the current version of Condor.
On 9/16/2011 9:52 AM, Rob wrote:
Here are my observation results on checkpointing with Windows:
A running program gets indeed the CTRL_SHUTDOWN_EVENT when Windows shuts down (and there's enough time to create checkpoint files on the local machine), but by then apparently Condor and/or the network are already in a "dead-enough" status, so that communicating with the condor master cannot happen anymore.
Upon boot up, the Windows computer does a clean up of the remainders of previous jobs, so that the job's history/checkpoint data is lost.
The only remedy here is to do regular checkpointing.
But how can I tell Condor to transfer the checkpoint files from the pool PC to the master, without evicting the job?
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: