Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] (no subject)
- Date: Thu, 17 Jul 2008 00:52:48 +0100 (BST)
- From: Bruce Beckles <mbb10@xxxxxxxxx>
- Subject: Re: [Condor-users] (no subject)
On Mon, 14 Jul 2008, Tanzima Zerin Islam wrote:
<snip>
"- If a user uses the kill_sig command in their submit description
file, does Condor (a) check the value given to ensure it is a valid
signal, and (b) restrict that value in any way (for instance, it doesn't
make sense for it to be SIGSTOP (23))?
In the Condor 6.6 series, the answer to both these questions seemed to be
"no". This may have changed since then.
- Scouring the manual I've discovered the following settings that affect
how long Condor will wait before escalating its attempt to stop the
job/its daemons:
KILLING_TIMEOUT: length of time after starting to vacate job
before a SIGKILL is sent
SHUTDOWN_FAST_TIMEOUT: length of time daemons are given to perform
a fast shutdown before they are killed
outright
SHUTDOWN_GRACEFUL_TIMEOUT: length of time daemons are given to do
a graceful shutdown before they do a
hard shutdown
Are there any other settings affecting this area that I've missed?
I suspect that in newer versions of Condor (i.e. post Condor 6.6.11) there
are probably some other settings - I haven't trawled the manual and source
sufficiently extensively to be sure.
What constitutes a "hard shutdown" in this context? Is it just sending
SIGKILL?
In the Vanilla universe in the Condor 6.6 series, a "hard shutdown"
certainly meant that a SIGKILL was sent. However, I never discovered
whether Condor did anything else as well as sending the SIGKILL. And, of
course, behaviour may have changed in subsequent releases of Condor.
- The example init boot script included in the Condor distribution sends a
SIGQUIT to the condor_master to initiate shutdown of Condor. The
comments in this script say:
# send SIGQUIT to the condor_master, which initiates its fast
# shutdown method. The master itself will start sending
# SIGKILL to all it's children if they're not gone in 20
# seconds.
Is this interval of 20 seconds correct (the comments at the top of the
script are dated 1998, so it may have changed since then)? Is this
interval hard-coded, or can it be changed? If it can be changed, how?
I never got an answer to this, and it was never important enough for me to
figure it out from the source code, I'm afraid. In our environment it
proved not to be important.
- The SIGQUIT, SIGHUP and SIGTERM are all handled by the DaemonCore
library, and so presumably might be sent by a Condor process to a Condor
daemon. Are SIGHUP and SIGQUIT ever sent by Condor to any processes
which are _not_ Condor daemons?
Again, I never got an answer to this, and it was never important enough
for me to figure it out from the source code, I'm afraid. In practice in
our environment it proved not to be so important.
- Condor detects if the job exits via a signal. Suppose my job (J) is
actually just a wrapper for some other program/shell script (P).
Suppose that after spawning P, J just waits for P to terminate and then
exits. IF P exits via a signal, will Condor regard that as the job
exiting via a signal, or will it regard it as "normal termination" (as J
has exited "normally")?
In the Condor 6.6 series, Condor would regard this as "normal termination"
because it only pays attention to how J terminates, not how any processes
spawned by J might terminate. Again, this may have changed in more recent
versions of Condor, although I would be surprised.
- In the Vanilla, Java, MPI, PVM and Scheduler universes, when Condor
vacates the job gracefully be sending it a SIGTERM (or whatever the
KillSig ClassAd attribute has been set to), does it send this signal
just to the immediate child of the condor_starter, or to all the
processes (if any) spawned by that child as well? "
In the Vanilla universe of the Condor 6.6 series, Condor seemed to send
the signal specified in the KillSig ClassAd attribute just to the
immediate child of the condor_starter. (I never investigated for the
other universes since we don't use them here.) Again, this behaviour may
have changed in more recent releases of Condor.
As far as I can remember, I never got an "official" answer to the above
questions, so my answers above are based on what I discovered by
experimentation, exhaustive reading of the manual, and trawling the source
code. So, whilst I believe the answers above are correct, at least for
later versions of the Condor 6.6 series, I may be wrong, or there may be
something unusual about our environment which means things behave
differently for you. (I.e. your mileage may vary!)
Hope that helps!
-- Bruce
--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.