[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] USER_JOB_WRAPPER and Unix signals



Hi -

	Section 3.3.12 of the Condor 6.6 manual, in the section that
documents the USER_JOB_WRAPPER setting, says:

"This macro allows the administrator to specify a ''wrapper'' script to
 handle the execution of all user jobs. ... This wrapper program must
 ultimately replace its image with the user job; in other words, it must
 exec() the user job, not fork() it."

I have two questions about this:

1. When it says "all user jobs" does it REALLY mean all user jobs
   regardless of the job's universe (so including Standard, Java, MPI, PVM
   and Scheduler universe jobs)?

2. Are the reasons for exec()-ing the user job rather than fork()-ing the
   following?:

	- To ensure that the user job inherits the environment Condor has
	  prepapred for it, including environment variables and
	  redirection of standard error and standard output?  Is there
	  anything else that needs to be preserved?

	- So that Condor 'knows' which process (PID) to send the Unix
	  control signals to cause the job to suspend, checkpoint or
	  vacate as necessary?

   ... or are the reasons something else entirely?  Or are there other
   reasons in addition to the ones I've suggested above?


This leads me on to what I really want to know which is, if my "wrapper"
program:

(a) ensures it passes the environment variables to, and preserves
    Condor's redirection of standard error and standard input for, its
    children, AND

(b) traps signals from the Condor starter and passes them on to its
    children, THEN

...can I fork() the user job instead of exec()-ing it, or will it all go
horribly wrong!?!


...and so I need to know what signals Condor will send to the user job -
trawling the manual seems to reveal the following:

- SIGUSR2:
    cause a job in the Standard universe to checkpoint and then continue
    executing.

- SIGTSTP (or the value of the KillSig ClassAd attribute):
    cause a job in the Standard universe to try and gracefully shutdown
    (i.e. checkpoint).

- SIGTERM (or the value of the KillSig ClassAd attribute):
    cause a job in the Vanilla universe to try and gracefully shutdown,
    i.e. normal Unix termination (noting that the program may catch
    SIGTERM and try to clean up).  Is this also true for jobs in the other
    non-Standard (Java, MPI, PVM and Scheduler) universes?

- SIGKILL:
    kill (i.e. send the hard-kill signal to) the job, if the job takes too
    long to gracefully shutdown or doesn't respond to the appropriate
    signal.

...but what about when it suspends a user job?  Does it send it a SIGSTOP?
Does it do anything else (as wel/instead of)?
...and similarly when it unsuspends a user job does it send a SIGCONT?
Does it do anything else (as well/instead of)?

Any help much appreciated!

	Thanks,

	  Bruce

--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.