[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] (no subject)



On Mon, 14 Jul 2008, Tanzima Zerin Islam wrote:

<snip>
"- If a user uses the kill_sig command in their submit description
 file, does Condor (a) check the value given to ensure it is a valid
 signal, and (b) restrict that value in any way (for instance, it doesn't
 make sense for it to be SIGSTOP (23))?

In the Condor 6.6 series, the answer to both these questions seemed to be "no". This may have changed since then.


- Scouring the manual I've discovered the following settings that affect
 how long Condor will wait before escalating its attempt to stop the
 job/its daemons:
	KILLING_TIMEOUT: length of time after starting to vacate job
	                  before a SIGKILL is sent
	SHUTDOWN_FAST_TIMEOUT: length of time daemons are given to perform
	                        a fast shutdown before they are killed
                               outright
	SHUTDOWN_GRACEFUL_TIMEOUT: length of time daemons are given to do
	                            a graceful shutdown before they do a
                                   hard shutdown
 Are there any other settings affecting this area that I've missed?

I suspect that in newer versions of Condor (i.e. post Condor 6.6.11) there are probably some other settings - I haven't trawled the manual and source sufficiently extensively to be sure.

 What constitutes a "hard shutdown" in this context?  Is it just sending
 SIGKILL?

In the Vanilla universe in the Condor 6.6 series, a "hard shutdown" certainly meant that a SIGKILL was sent. However, I never discovered whether Condor did anything else as well as sending the SIGKILL. And, of course, behaviour may have changed in subsequent releases of Condor.


- The example init boot script included in the Condor distribution sends a
 SIGQUIT to the condor_master to initiate shutdown of Condor.  The
 comments in this script say:
       # send SIGQUIT to the condor_master, which initiates its fast
       # shutdown method.  The master itself will start sending
       # SIGKILL to all it's children if they're not gone in 20
       # seconds.

 Is this interval of 20 seconds correct (the comments at the top of the
 script are dated 1998, so it may have changed since then)?  Is this
 interval hard-coded, or can it be changed?  If it can be changed, how?

I never got an answer to this, and it was never important enough for me to figure it out from the source code, I'm afraid. In our environment it proved not to be important.


- The SIGQUIT, SIGHUP and SIGTERM are all handled by the DaemonCore
 library, and so presumably might be sent by a Condor process to a Condor
 daemon.  Are SIGHUP and SIGQUIT ever sent by Condor to any processes
 which are _not_ Condor daemons?

Again, I never got an answer to this, and it was never important enough for me to figure it out from the source code, I'm afraid. In practice in our environment it proved not to be so important.


- Condor detects if the job exits via a signal.  Suppose my job (J) is
 actually just a wrapper for some other program/shell script (P).
 Suppose that after spawning P, J just waits for P to terminate and then
 exits.  IF P exits via a signal, will Condor regard that as the job
 exiting via a signal, or will it regard it as "normal termination" (as J
 has exited "normally")?

In the Condor 6.6 series, Condor would regard this as "normal termination" because it only pays attention to how J terminates, not how any processes spawned by J might terminate. Again, this may have changed in more recent versions of Condor, although I would be surprised.


- In the Vanilla, Java, MPI, PVM and Scheduler universes, when Condor
 vacates the job gracefully be sending it a SIGTERM (or whatever the
 KillSig ClassAd attribute has been set to), does it send this signal
 just to the immediate child of the condor_starter, or to all the
 processes (if any) spawned by that child as well? "

In the Vanilla universe of the Condor 6.6 series, Condor seemed to send the signal specified in the KillSig ClassAd attribute just to the immediate child of the condor_starter. (I never investigated for the other universes since we don't use them here.) Again, this behaviour may have changed in more recent releases of Condor.


As far as I can remember, I never got an "official" answer to the above questions, so my answers above are based on what I discovered by experimentation, exhaustive reading of the manual, and trawling the source code. So, whilst I believe the answers above are correct, at least for later versions of the Condor 6.6 series, I may be wrong, or there may be something unusual about our environment which means things behave differently for you. (I.e. your mileage may vary!)

Hope that helps!

	-- Bruce

--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.