[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...



Currently for vanilla universe jobs, the checkpoint files are only moved off of the execute machine when the job is vacated or evicted.

In order to get this to work, the submit file for the job needs to have

    when_to_transfer_output = ON_EXIT_OR_EVICT

And the submit machine must still be running. (i.e. the librarian needs to shut that machine down after the execute machines).

That's about the best you can hope for with the current version of Condor.


On 9/16/2011 9:52 AM, Rob wrote:
Hi,

Here are my observation results on checkpointing with Windows:

A running program gets indeed the CTRL_SHUTDOWN_EVENT when Windows shuts down (and there's enough time to create checkpoint files on the local machine), but by then apparently Condor and/or the network are already in a "dead-enough" status, so that communicating with the condor master cannot happen anymore.
Upon boot up, the Windows computer does a clean up of the remainders of previous jobs, so that the job's history/checkpoint data is lost.

The only remedy here is to do regular checkpointing.

But how can I tell Condor to transfer the checkpoint files from the pool PC to the master, without evicting the job?

Thanks,

Rob

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/