[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...




-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Rob
Sent: 24 September 2011 15:12
To: condor-users
Subject: Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...

On Tue, 20 Sep 2011 09:41:31 "Smith, Ian" wrote:
>
> I've experimented quite a bit with this here and I wrote up a little guide for
> Matlab based jobs (which many Condor users run here). It's at:
>
> http://www.liv.ac.uk/csd/escience/condor/checkpoint.htm

Yes, I know that link and it is very useful!
Thank you for making this information available online.


Concerning the end-of-day shutdown:
I tried it, but there is no way I can catch that signal with the computers
in the library here. Apparently the Windows systems are already too far
down for the checkpoint procedure to complete successfully.

The only alternative seems to be something like a "condor_vacate -all" or
"condor_vacate_job -all" some 15 minutes before the official shutdown time
of the PCs....but I have not yet played with this option.
Any recommendations or thoughts on this?

Thank you.
Rob.

I think that's probably the best idea - run "condor_vacate_job -all". 
For long running jobs it may be worth doing this more frequently to guard against other 
unexpected shutdowns. 

It would be nice if Condor had some period vacate function similar to PERIODIC_REMOVE
but AFAIK as I know there isn't one. A workaround was suggested earlier to a post
of mine on this list but I never managed to get around to looking at this properly.
See under:

https://lists.cs.wisc.edu/archive/condor-users/2010-June/msg00169.shtml

regards,

-ian.