[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm not killing subprocesses



Alright. How fast is this snapshot taken? The jobs I am seeing this with have runtimes measured in hours and days. My test cases are infinite.

-Jacob

Michael Yoder wrote:
I'm a little confused by your note of no operating system support.  I
have indicated a reliable way of finding these processes, at least on
Linux. I now only seek some way of having Condor use this method, even
if it means wrapping the condor executables.


Mark is right - there really is no good OS support.  All a process has
to do is fork twice and have the intermediate process exit.  Then the
grandchild will be inherited by init.  Condor's method of taking
snapshots of the process tree catches this...if it doesn't happen too
fast.  The problem is, it frequently happens too fast.

Mike Yoder
Principal Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.904.5992
Mobile : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com




-Jacob

Mark Silberstein wrote:

Unfortunately there's not too much you can do - Condor kill

mechanism is

as simple as sending kill to the process and to all its children.

Seems

OK, but the way Condor detects the children of the process is a bit
problematic, sincethere's no operating system support for this in

Linux.

So it samples the process tree periodically. If you are unlucky

enough

to issue condor_rm before Condor samples the process tree - too bad,
you've got runaway child.
The only thing I think you can do is to run Cron  job on all your
machines which does this garbage collection.

On Fri, 2005-06-03 at 14:24 -0400, Jacob Joseph wrote:


As I mentioned, it does work to kill off the PGID.  Since I can't
realistically expect all of my users to clean up whatever they might
spawn, I'm looking for a method on the Condor side of things that
guarantees all jobs started by a user will be killed.  Can anyone
suggest a method of modifying condor's kill behavior?

-Jacob

Mark Silberstein wrote:


Hi
Let me correct my last mail - it's simply unbelievable.
I checked my own answer and was totally wrong. When bash script is
killed, it leaves its children alive. There are several threads on

this

in Google, and I was curious enough to check. Indeed, it is claimed

that

there's no simple solution to this problem.
So the only thing I would do is to trap EXIT in the script and kill

all

running processes. It does work for this simple snippet:

procname=sleep
clean(){
killall $procname
}
trap clean EXIT
for i in {1..10}; do
	$procname 100
done

If you kill this script, sleep is killed.

Mark

On Fri, 2005-06-03 at 01:18 -0400, Jacob Joseph wrote:



Hi. I have a number of users who have taken to wrapping their

jobs

within shell scripts. Often, they'll use a for or while loop to

execute

a single command with various permutations. When such a job is

removed

with condor_rm, the main script is killed, but subprocesses

spawned

from

inside a loop will not be killed and will continue to run on the

compute

machine. This naturally interferes with jobs which are later

assigned

to that machine.

Does anyone know of a way to force bash subprocesses to be killed

along

with the parent upon removal with condor_rm? (This behavior is

not

unique to condor_rm. A kill to the parent also leaves the

subprocess

running.)

-Jacob
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users