Re: [Condor-users] condor_rm not killing subprocesses

I'm a little confused by your note of no operating system support.  I
have indicated a reliable way of finding these processes, at least on
Linux.  I now only seek some way of having Condor use this method, even
if it means wrapping the condor executables.


Mark Silberstein wrote:
> Unfortunately there's not too much you can do - Condor kill mechanism is
> as simple as sending kill to the process and to all its children. Seems
> OK, but the way Condor detects the children of the process is a bit
> problematic, sincethere's no operating system support for this in Linux.
> So it samples the process tree periodically. If you are unlucky enough
> to issue condor_rm before Condor samples the process tree - too bad,
> you've got runaway child.
> The only thing I think you can do is to run Cron  job on all your
> machines which does this garbage collection.
> On Fri, 2005-06-03 at 14:24 -0400, Jacob Joseph wrote:
>>As I mentioned, it does work to kill off the PGID.  Since I can't
>>realistically expect all of my users to clean up whatever they might
>>spawn, I'm looking for a method on the Condor side of things that
>>guarantees all jobs started by a user will be killed.  Can anyone
>>suggest a method of modifying condor's kill behavior?
>>Mark Silberstein wrote:
>>>Let me correct my last mail - it's simply unbelievable.
>>>I checked my own answer and was totally wrong. When bash script is
>>>killed, it leaves its children alive. There are several threads on this
>>>in Google, and I was curious enough to check. Indeed, it is claimed that
>>>there's no simple solution to this problem.
>>>So the only thing I would do is to trap EXIT in the script and kill all
>>>running processes. It does work for this simple snippet:
>>> killall $procname
>>>trap clean EXIT
>>>for i in {1..10}; do
>>>	$procname 100
>>>If you kill this script, sleep is killed.
>>>On Fri, 2005-06-03 at 01:18 -0400, Jacob Joseph wrote:
>>>>Hi.  I have a number of users who have taken to wrapping their jobs
>>>>within shell scripts.  Often, they'll use a for or while loop to execute
>>>>a single command with various permutations.  When such a job is removed
>>>>with condor_rm, the main script is killed, but subprocesses spawned from
>>>>inside a loop will not be killed and will continue to run on the compute
>>>>machine.  This naturally interferes with jobs which are later assigned
>>>>to that machine.
>>>>Does anyone know of a way to force bash subprocesses to be killed along
>>>>with the parent upon removal with condor_rm?  (This behavior is not
>>>>unique to condor_rm.  A kill to the parent also leaves the subprocess
