[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] USER_JOB_WRAPPER and Unix signals
- Date: Mon, 16 Aug 2004 09:35:31 +0100 (BST)
- From: Bruce Beckles <mbb10@xxxxxxxxx>
- Subject: Re: [Condor-users] USER_JOB_WRAPPER and Unix signals
On Wed, 11 Aug 2004, Dan Bradley wrote:
Thanks, Dan for your comments - I have a few more questions on Condor's
use of Unix/Linux signals which I hope you or someone else can help me
- If a user uses the kill_sig command in their submit description
file, does Condor (a) check the value given to ensure it is a valid
signal, and (b) restrict that value in any way (for instance, it doesn't
make sense for it to be SIGSTOP (23))?
- Scouring the manual I've discovered the following settings that affect
how long Condor will wait before escalating its attempt to stop the
KILLING_TIMEOUT: length of time after starting to vacate job
before a SIGKILL is sent
SHUTDOWN_FAST_TIMEOUT: length of time daemons are given to perform
a fast shutdown before they are killed
SHUTDOWN_GRACEFUL_TIMEOUT: length of time daemons are given to do
a graceful shutdown before they do a
Are there any other settings affecting this area that I've missed?
What constitutes a "hard shutdown" in this context? Is it just sending
- The example init boot script included in the Condor distribution sends a
SIGQUIT to the condor_master to initiate shutdown of Condor. The
comments in this script say:
# send SIGQUIT to the condor_master, which initiates its fast
# shutdown method. The master itself will start sending
# SIGKILL to all it's children if they're not gone in 20
Is this interval of 20 seconds correct (the comments at the top of the
script are dated 1998, so it may have changed since then)? Is this
interval hard-coded, or can it be changed? If it can be changed, how?
- The SIGQUIT, SIGHUP and SIGTERM are all handled by the DaemonCore
library, and so presumably might be sent by a Condor process to a Condor
daemon. Are SIGHUP and SIGQUIT ever sent by Condor to any processes
which are _not_ Condor daemons?
- Condor detects if the job exits via a signal. Suppose my job (J) is
actually just a wrapper for some other program/shell script (P).
Suppose that after spawning P, J just waits for P to terminate and then
exits. IF P exits via a signal, will Condor regard that as the job
exiting via a signal, or will it regard it as "normal termination" (as J
has exited "normally")?
- In the Vanilla, Java, MPI, PVM and Scheduler universes, when Condor
vacates the job gracefully be sending it a SIGTERM (or whatever the
KillSig ClassAd attribute has been set to), does it send this signal
just to the immediate child of the condor_starter, or to all the
processes (if any) spawned by that child as well?
> Bruce Beckles wrote:
> >...and so I need to know what signals Condor will send to the user job -
> >trawling the manual seems to reveal the following:
> >- SIGUSR2:
> > cause a job in the Standard universe to checkpoint and then continue
> > executing.
> >- SIGTSTP (or the value of the KillSig ClassAd attribute):
> > cause a job in the Standard universe to try and gracefully shutdown
> > (i.e. checkpoint).
> >- SIGTERM (or the value of the KillSig ClassAd attribute):
> > cause a job in the Vanilla universe to try and gracefully shutdown,
> > i.e. normal Unix termination (noting that the program may catch
> > SIGTERM and try to clean up). Is this also true for jobs in the other
> > non-Standard (Java, MPI, PVM and Scheduler) universes?
> >- SIGKILL:
> > kill (i.e. send the hard-kill signal to) the job, if the job takes too
> > long to gracefully shutdown or doesn't respond to the appropriate
> > signal.
Apart from SIGSTOP/SIGCONT for suspending/continuing a job, are there any
other signals I missed? (Obviously the user can set the KillSig ClassAd
attribute to a signal I've not listed above...)
Any answers/information gratefully received!
University of Cambridge Computing Service.