[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] condor_rm not killing subprocesses




I'm a little confused by your note of no operating system support.  I
have indicated a reliable way of finding these processes, at least on
Linux. I now only seek some way of having Condor use this method, even
if it means wrapping the condor executables.

If you are using the dynamically linked Condor executables you could always write your own replacement signalling functions, put them in a library and use LD_PRELOAD to have your library handle Condor's kill signals. That's a non-trivial amount of work, of course, but it would do it.


Alternatively, you could (using the USER_JOB_WRAPPER feature) arrange for a process to start up that ptraces the user's job's main process, and, when it dies, sends a kill signal to all its children. That is also a bit nasty to implement, and there are various edge-cases you need to consider.

A simpler solution is to use either the system cron facility or Condor's STARTD_CRON facility to run a job once a minute that checks to see if a Condor job is supposed to be running; if so it exits - if not, it looks for any stray child processes and kills them. That's what we do here. Of course, you run into problems if Condor starts another job within a minute of the previous job finishing...

Perhaps the simplest solution (using the USER_JOB_WRAPPER feature) is to have a wrapper that kills any stray processes left behind by the previous job when a new job starts. That assumes that all jobs run under the same UID, of course, or else you have to get clever with something like sudo or userv (*) or sud (**). But if all jobs run under the same UID, you might as well use dedicated user acoounts as I mentioned in my previous reply.

(*) http://www.chiark.greenend.org.uk/~ian/userv/
(**) http://sud.sourceforge.net/

For reference (in the 6.6 series):

	USER_JOB_WRAPPER: http://www.cs.wisc.edu/condor/manual/v6.6/3_3Configuration.html#9119

	STARTD_CRON: http://www.cs.wisc.edu/condor/manual/v6.6/3_3Configuration.html#8744

Hope that is of some use/interest,

	-- Bruce

--
Bruce Beckles,
e-Science Specialist,
University of Cambridge Computing Service.