[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] USER_JOB_WRAPPER and Unix signals



On Wed, 11 Aug 2004, Dan Bradley wrote:
<snip>

Thanks, Dan for your comments - I have a few more questions on Condor's
use of Unix/Linux signals which I hope you or someone else can help me
with:

- If a user uses the kill_sig command in their submit description
  file, does Condor (a) check the value given to ensure it is a valid
  signal, and (b) restrict that value in any way (for instance, it doesn't
  make sense for it to be SIGSTOP (23))?


- Scouring the manual I've discovered the following settings that affect
  how long Condor will wait before escalating its attempt to stop the
  job/its daemons:
	KILLING_TIMEOUT: length of time after starting to vacate job
	                  before a SIGKILL is sent
	SHUTDOWN_FAST_TIMEOUT: length of time daemons are given to perform
	                        a fast shutdown before they are killed
                                outright
	SHUTDOWN_GRACEFUL_TIMEOUT: length of time daemons are given to do
	                            a graceful shutdown before they do a
                                    hard shutdown
  Are there any other settings affecting this area that I've missed?
  What constitutes a "hard shutdown" in this context?  Is it just sending
  SIGKILL?


- The example init boot script included in the Condor distribution sends a
  SIGQUIT to the condor_master to initiate shutdown of Condor.  The
  comments in this script say:
        # send SIGQUIT to the condor_master, which initiates its fast
        # shutdown method.  The master itself will start sending
        # SIGKILL to all it's children if they're not gone in 20
        # seconds.

  Is this interval of 20 seconds correct (the comments at the top of the
  script are dated 1998, so it may have changed since then)?  Is this
  interval hard-coded, or can it be changed?  If it can be changed, how?


- The SIGQUIT, SIGHUP and SIGTERM are all handled by the DaemonCore
  library, and so presumably might be sent by a Condor process to a Condor
  daemon.  Are SIGHUP and SIGQUIT ever sent by Condor to any processes
  which are _not_ Condor daemons?


- Condor detects if the job exits via a signal.  Suppose my job (J) is
  actually just a wrapper for some other program/shell script (P).
  Suppose that after spawning P, J just waits for P to terminate and then
  exits.  IF P exits via a signal, will Condor regard that as the job
  exiting via a signal, or will it regard it as "normal termination" (as J
  has exited "normally")?


- In the Vanilla, Java, MPI, PVM and Scheduler universes, when Condor
  vacates the job gracefully be sending it a SIGTERM (or whatever the
  KillSig ClassAd attribute has been set to), does it send this signal
  just to the immediate child of the condor_starter, or to all the
  processes (if any) spawned by that child as well? 


> Bruce Beckles wrote:
<snip>
> >...and so I need to know what signals Condor will send to the user job -
> >trawling the manual seems to reveal the following:
> >
> >- SIGUSR2:
> >    cause a job in the Standard universe to checkpoint and then continue
> >    executing.
> >
> >- SIGTSTP (or the value of the KillSig ClassAd attribute):
> >    cause a job in the Standard universe to try and gracefully shutdown
> >    (i.e. checkpoint).
> >
> >- SIGTERM (or the value of the KillSig ClassAd attribute):
> >    cause a job in the Vanilla universe to try and gracefully shutdown,
> >    i.e. normal Unix termination (noting that the program may catch
> >    SIGTERM and try to clean up).  Is this also true for jobs in the other
> >    non-Standard (Java, MPI, PVM and Scheduler) universes?
> >
> >- SIGKILL:
> >    kill (i.e. send the hard-kill signal to) the job, if the job takes too
> >    long to gracefully shutdown or doesn't respond to the appropriate
> >    signal.
<snip>

Apart from SIGSTOP/SIGCONT for suspending/continuing a job, are there any
other signals I missed?  (Obviously the user can set the KillSig ClassAd
attribute to a signal I've not listed above...)


Any answers/information gratefully received!

Thanks,

  Bruce

--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.