[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Fault Behaviour of Condor



On 8/2/06, thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx
<thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx> wrote:

Hi,

I'm currently running a small Condor 6.7.19 Pool with GT4 Gram as Submit
Interface for testing.
I wanted to test the Condors behaviour in case of several fault scenarios.
Here are my results:

1.) Killing the job on the executor machine
Outcome: Condor returned an exit code of 1

This is the desired behaviour

2.) Shuting down the condor deamons on the executor
Outcome: Condor restarted the job on another machine -- WOW, is this
standard behaviour of Condor?!
I never saw that.

This is again the desired behaviour - with the notable exception of
disks dying (see recent post) condor is very well behaved for an
execute machine stopping itself nicely.

3.) Shutting down the NIC on the executor (I assume same as pulling the
plug)
Outcome: Condor hangs, a shadow process is existing all the time
I even cannot remove the job with condor_rm!
Maybe a bug? what can I do?

condor_rm -forcex may get rid of it (you may need to kill off the
shadow by hand, it should eventually timeout though, how long did you
give it?).

Some older versions in the 6.6 series, at least on windows, were very
poor at cleaning up dead jobs when the execute machine stopped
responding at all.

Matt