[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
- Date: Wed, 19 Dec 2007 16:09:17 +0100
- From: Rob de Graaf <rob@xxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
Thanks for the reply,
Your analysis here is wrong. The STARTER gets SIGTERM. It then kills
Note that the SIGQUIT comes after the job has exited. This is part of
the normal termination of the STARTER by the STARTD after the job has
finished. The STARTER doesn't know why the job exited, only that it did.
I see.. so regardless of the job's exit status, the starter only knows
that a job has exited, and the startd then terminates the starter?
Why are administrators killing Condor jobs? Note I don't say sending
them SIGQUIT because that isn't what is happening, they are killing
the jobs outside of Condor. Why aren't they using condor_vacate or
condor_vacate_job for this purpose? There is no way for Condor to
know why the job exited otherwise.
The problem is that the majority of our machine owners are also local
administrators for those machines, and the pool is too big and varied to
instruct everyone on condor_vacate and suspension policy settings. So
what happens sometimes is that a machine owner logs in and kills a
suspended condor_exec process to reclaim resources.
We could default to a want_suspend = false policy, eliminating the need
for local administrators to reclaim resources, but since most jobs do
not checkpoint so we'd prefer to have suspension where possible.
If we can't "catch" jobs that are being killed outside condor, I suppose
the only way is to re-queue them after reviewing the logs with non-zero
Rob de Graaf