[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing on Windows pool PCs: I need little help...



Only just had a look at this as I don't seem to get much time to
read condor-users these days. Some suggestions inline below

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Rob
> Sent: 15 September 2011 03:28
> To: condor-users
> Subject: [Condor-users] Checkpointing on Windows pool PCs: I need little help...
> 
> Hello,
> 
> I have used section 6.2.8 of the Condor manual and other references to get some
> kind of artificial checkpointing to work on Windows pool PCs, using the
> "SetConsoleCtrlHandler()" to catch the CTRL_CLOSE_EVENT, which allows me to
> save relevant data right before Condor throws the job from the Windows system.
> 
> In order to use the checkpointed data file, my program also checks at the beginning
> whether the checkpoint file exists, and if so, it initializes itself with that data, so that
> it continues where it has left off at the previous eviction.
> 
> All this works great!
> And I can see the checkpoint files appear in the temporary spool directories of the
> master PC.
> 

I've experimented quite a bit with this here and I wrote up a little guide for
Matlab based jobs (which many Condor users run here). It's at:

http://www.liv.ac.uk/csd/escience/condor/checkpoint.htm

I've had one users who has employed this very successfully to run jobs of around
a week's duration. I think the people at University of Manchester use a similar
procedure.

> 
> Now, there is another issue, that I'm unsure about:
> 
> All the Windows pool PCs are public computers in a university library.
> By the end of the day, after the library has closed its doors, a library IT person shuts
> down all the PCs (the timing is not fixed; sometimes he does it before dinner,
> sometimes afterwards.....).
> Especially at library closing time, most library PCs are not used and almost all are
> running Condor jobs. Upon shutdown, I expect Condor to just being squashed and
> hence no time for checkpointing. Is that right?
> 
> My question here is: would it work to also catch the "CTRL_SHUTDOWN_EVENT" in
> my program? Or is it already too late by then? (With "too late" I mean: at that stage
> the network interface and Condor are already dead!?!).
> 

This sounds very familiar. It really depends how the staff shutdown the PCs. If they shutdown
using the Start menu then my feeling is that your application won't get this
signal but the Condor processes will shutdown cleanly so at least the central manager
can reschedule the job elsewhere (even if that is the following day). Even if the application 
did get this signal I doubt that there would be time to transfer the checkpoint file(s) to the submitter before 
the Condor processes where killed and communication with the submitter was lost in
any case. That would happen before the network interface went down.

Unfortunately here I've found that the some of our PC users are less well behaved
and just hit the front panel power button to shutdown or even unplug the PC
from the mains (so that they can plug in their mobile phone chargers presumably !!!) .
Of course there is no way you would be able to do a checkpoint in these circumstances
but what makes things worse is that it can take the central manager quite a while
to work out that the PC is effectively dead and reschedule the job (I've seen this take twelve 
hours or more) which obviously very bad for throughput. There are some config variables
that can be used to alter the timeouts but I've yet to play with them.

I wrote up quite a detailed account of my dealings with this kind of thing (particularly
for power saving PCs). You can find it here.

http://www.liv.ac.uk/csd/escience/condor/cardiff_condor.pdf


regards,

-ian.

PS One way around your problem might be to evict all jobs before the library closes
     (use condor_vacate_job) so that you capture the checkpoints then put them on hold 
    for a while so that they aren't running when the machines are shutdown. That would
    avoid the timeout problem if the staff do just power them offf.