[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM



Daniel,

Thanks for the reply,

Your analysis here is wrong.  The STARTER gets SIGTERM.  It then kills
the job.

Note that the SIGQUIT comes after the job has exited.  This is part of
the normal termination of the STARTER by the STARTD after the job has
finished.  The STARTER doesn't know why the job exited, only that it did.

I see.. so regardless of the job's exit status, the starter only knows that a job has exited, and the startd then terminates the starter?

Why are administrators killing Condor jobs?  Note I don't say sending
them SIGQUIT because that isn't what is happening, they are killing
the jobs outside of Condor.  Why aren't they using condor_vacate or
condor_vacate_job for this purpose?  There is no way for Condor to
know why the job exited otherwise.

The problem is that the majority of our machine owners are also local administrators for those machines, and the pool is too big and varied to instruct everyone on condor_vacate and suspension policy settings. So what happens sometimes is that a machine owner logs in and kills a suspended condor_exec process to reclaim resources.

We could default to a want_suspend = false policy, eliminating the need for local administrators to reclaim resources, but since most jobs do not checkpoint so we'd prefer to have suspension where possible.

If we can't "catch" jobs that are being killed outside condor, I suppose the only way is to re-queue them after reviewing the logs with non-zero return values?

Thanks,

Rob de Graaf