Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM

Date: Wed, 19 Dec 2007 16:58:48 -0600
From: Daniel Forrest <forrest@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM

Todd Tannenbaum wrote:
> Rob de Graaf wrote:
> >If we can't "catch" jobs that are being killed outside condor, I suppose 
> >the only way is to re-queue them after reviewing the logs with non-zero 
> >return values?
> 
> Course the worry there is what if your job actually exits with non-zero?
> 
> Another idea is to ask Condor to rerun the job if it is killed with a 
> sigterm or a sigquit signal.  Seems unlikely that a job would exit on 
> its own accord with either of those signals.
> 
> Off the top of my head, I think you could do the above by placing the 
> following in your condor submit file:
> 
>    on_exit_remove = (ExitBySignal == False) ||
>                     ((ExitSignal != 3) && (ExitSignal != 15))

I don't think this will help.  The jobs are exiting normally with an
exit code of 1, not by getting a signal.  Whatever the Windows process
is for killing a job, it doesn't look like a signal to Condor.

What we do here is something similar to this:

OnExitHold = ((ExitCode =!= UNDEFINED) && (ExitCode != 0)) ||
	     (ExitBySignal == True)

(This could probably be reduced to (ExitCode =!= 0) since at exit time
 ExitCode should only be UNDEFINED when ExitBySignal is TRUE.)

Then we check periodically for jobs that have gone on hold and decide
based on log information whether to condor_release or condor_rm them.
There are issues with this for Standard Universe, but Vanilla Universe
should be fine.  This avoids the problem of ever having to recreate a
job which will then have a different ClusterId.ProcId.  Of course, it
helps if most of your jobs usually exit with a code of 0.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison

References:
- [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
  - From: rob
- Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
  - From: Daniel Forrest
- Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
  - From: Rob de Graaf
- Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
  - From: Todd Tannenbaum

Prev by Date: [Condor-users] STARTD died due to exception ACCESS_VIOLATION
Next by Date: [Condor-users] transfer output files back to the submit machine every hour
Previous by thread: Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM
Next by thread: [Condor-users] core file from job
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM