[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM



Todd Tannenbaum wrote:
> Rob de Graaf wrote:
> >If we can't "catch" jobs that are being killed outside condor, I suppose 
> >the only way is to re-queue them after reviewing the logs with non-zero 
> >return values?
> 
> Course the worry there is what if your job actually exits with non-zero?
> 
> Another idea is to ask Condor to rerun the job if it is killed with a 
> sigterm or a sigquit signal.  Seems unlikely that a job would exit on 
> its own accord with either of those signals.
> 
> Off the top of my head, I think you could do the above by placing the 
> following in your condor submit file:
> 
>    on_exit_remove = (ExitBySignal == False) ||
>                     ((ExitSignal != 3) && (ExitSignal != 15))

I don't think this will help.  The jobs are exiting normally with an
exit code of 1, not by getting a signal.  Whatever the Windows process
is for killing a job, it doesn't look like a signal to Condor.


What we do here is something similar to this:

OnExitHold = ((ExitCode =!= UNDEFINED) && (ExitCode != 0)) ||
	     (ExitBySignal == True)

(This could probably be reduced to (ExitCode =!= 0) since at exit time
 ExitCode should only be UNDEFINED when ExitBySignal is TRUE.)
 
Then we check periodically for jobs that have gone on hold and decide
based on log information whether to condor_release or condor_rm them.
There are issues with this for Standard Universe, but Vanilla Universe
should be fine.  This avoids the problem of ever having to recreate a
job which will then have a different ClusterId.ProcId.  Of course, it
helps if most of your jobs usually exit with a code of 0.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison