[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] rebooting a submission node

On Wed, 18 May 2005 09:31:38 +0200  matthias.m.roehm@xxxxxxxxxxxxxxxxxxx wrote:

> Why are vanilla jobs killed if the submit machine is down? I thought
> that vanilla jobs don't communicate (no remote system calls) with
> the submit machine during execution.

that's not entirely true.  vanilla jobs can use "chirp" to do remote
I/O back to the submit machine.  vanilla jobs also send back periodic
updates to the submit machine, and there are user-defined job policies
(like periodic_hold, etc) that the shadow at the submit machine will
evaluate and possibly take action at the remote execution site. 

> The only critical time I can think of is when a job ends and the
> result should be transfered. But why kill a job which is running for
> days because the submit machine is only down for a few minutes?

that's the whole point of the job_lease_duration stuff.  if there's a
temporary network failure, the submit machine crashes, etc, then
condor can recover.  however, if you gracefully shutdown the schedd on
the submit machine (SIGTERM, condor_off), condor assumes you're
shutting down the submit machine. ;) by default, when you reboot a
machine, your OS is going to send SIGTERM to all the pids, which
condor will interpret as a graceful shutdown.  since condor is trying
to be a good network citizen and clean-up after itself, it tries to
evict all the jobs being served from that submit machine.  we have no
way of knowing if you're planning to bring the schedd back up in 5
minutes or 5 weeks, and people complain if jobs keep running once you
"shutdown condor".

therefore, if you do NOT want the schedd to try to clean up after
itself, you want the remote jobs to keep running, and you want the
schedd to reconnect after it comes back up (which should be relatively
soon), you just need to do a fast-shutdown (SIGQUIT or condor_off
-fast) before you actually reboot the OS.  that tells the schedd
"don't cleanup your remote jobs, just kill off all the condor_shadow
processes and exit immediately", which is exactly the behavior you