[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] condor_rm not killing subprocesses



> I'm a little confused by your note of no operating system support.  I
> have indicated a reliable way of finding these processes, at least on
> Linux. I now only seek some way of having Condor use this method, even
> if it means wrapping the condor executables.

Mark is right - there really is no good OS support.  All a process has
to do is fork twice and have the intermediate process exit.  Then the
grandchild will be inherited by init.  Condor's method of taking
snapshots of the process tree catches this...if it doesn't happen too
fast.  The problem is, it frequently happens too fast.

Mike Yoder
Principal Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.904.5992
Mobile : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com



> -Jacob
> 
> Mark Silberstein wrote:
> > Unfortunately there's not too much you can do - Condor kill
mechanism is
> > as simple as sending kill to the process and to all its children.
Seems
> > OK, but the way Condor detects the children of the process is a bit
> > problematic, sincethere's no operating system support for this in
Linux.
> > So it samples the process tree periodically. If you are unlucky
enough
> > to issue condor_rm before Condor samples the process tree - too bad,
> > you've got runaway child.
> > The only thing I think you can do is to run Cron  job on all your
> > machines which does this garbage collection.
> >
> > On Fri, 2005-06-03 at 14:24 -0400, Jacob Joseph wrote:
> >
> >>As I mentioned, it does work to kill off the PGID.  Since I can't
> >>realistically expect all of my users to clean up whatever they might
> >>spawn, I'm looking for a method on the Condor side of things that
> >>guarantees all jobs started by a user will be killed.  Can anyone
> >>suggest a method of modifying condor's kill behavior?
> >>
> >>-Jacob
> >>
> >>Mark Silberstein wrote:
> >>
> >>>Hi
> >>>Let me correct my last mail - it's simply unbelievable.
> >>>I checked my own answer and was totally wrong. When bash script is
> >>>killed, it leaves its children alive. There are several threads on
this
> >>>in Google, and I was curious enough to check. Indeed, it is claimed
> that
> >>>there's no simple solution to this problem.
> >>>So the only thing I would do is to trap EXIT in the script and kill
all
> >>>running processes. It does work for this simple snippet:
> >>>
> >>>procname=sleep
> >>>clean(){
> >>> killall $procname
> >>>}
> >>>trap clean EXIT
> >>>for i in {1..10}; do
> >>>	$procname 100
> >>>done
> >>>
> >>>If you kill this script, sleep is killed.
> >>>
> >>>Mark
> >>>
> >>>On Fri, 2005-06-03 at 01:18 -0400, Jacob Joseph wrote:
> >>>
> >>>
> >>>>Hi.  I have a number of users who have taken to wrapping their
jobs
> >>>>within shell scripts.  Often, they'll use a for or while loop to
> execute
> >>>>a single command with various permutations.  When such a job is
> removed
> >>>>with condor_rm, the main script is killed, but subprocesses
spawned
> from
> >>>>inside a loop will not be killed and will continue to run on the
> compute
> >>>>machine.  This naturally interferes with jobs which are later
assigned
> >>>>to that machine.
> >>>>
> >>>>Does anyone know of a way to force bash subprocesses to be killed
> along
> >>>>with the parent upon removal with condor_rm?  (This behavior is
not
> >>>>unique to condor_rm.  A kill to the parent also leaves the
subprocess
> >>>>running.)
> >>>>
> >>>>-Jacob
> >>>>_______________________________________________
> >>>>Condor-users mailing list
> >>>>Condor-users@xxxxxxxxxxx
> >>>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>>
> >
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users