[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Submit host has stopped working on production system



On 6/24/05, Dr Ian C. Smith <i.c.smith@xxxxxxxxxxxxxxx> wrote:
> Dear All,
> 
> I've been trying to make a small config change to
> our condor submit host and it looks as though it
> has brought the whole pool down and killed off all
> of the jobs running on it after it was working fine
> for months.

Config changes are worth testing on a machine first before an auto rollout... 

> After the condor_master and condor_schedd
> were restarted the schedd seems to have gone
> beserk and is taking > 98 % of the CPU. If I
> try to run condor_q it just freezes. condor_status
> is OK.

Given your next mail I assume this was just it dealing with a whole
bunch of jobs starting up.

<snip>

> Important aside: how can I stop/start the daemons on the submit host
> without killing all the jobs in the pool (some of which may have
> been running for days or weeks).

given ...
> 6/24 11:51:54 ** $CondorVersion: 6.6.7 Oct 11 2004 $
short answer - you can't.

If the schedd dies without a job lease that's it - all jobs launched
from it currently in flight are going to die. I doubt they even
checkpoint properly (but I am not certain of this)
Since job leasing is only available in 6.7 dev series you're SOL...

Even if you have job leasing if you wish a job to survive a schedd
going down you have to kill it with extreme prejudice (kill) rather
than shut it down nicely. Not sure if the same goes for the shadow,
but a moot point given your version.

Matt