[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Antwort: Re: Fault Behaviour of Condor




Hi Matt,

For 1.) and 2.) the behaviour is just fine! -- I've also followed the discussion
regarding disk failure.
Maybe the documentation should state more clearly that
Condors default behaviour is to restart a job in case if a fault
(I might have overseen that).

Regarding 3.)
I gave it over an hour I think.
I've updated my Executors to 6.8 but the behaviour persists.
Do you think moving the central manager to 6.8 can resolve this?

thanks, Thomas




matthew.hope@xxxxxxxxx
Gesendet von: condor-users-bounces@xxxxxxxxxxx

02.08.2006 17:57

Bitte antworten an
condor-users@xxxxxxxxxxx

An
condor-users@xxxxxxxxxxx
Kopie
Thema
Re: [Condor-users] Fault Behaviour of Condor





On 8/2/06, thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx
<thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> I'm currently running a small Condor 6.7.19 Pool with GT4 Gram as Submit
> Interface for testing.
> I wanted to test the Condors behaviour in case of several fault scenarios.
> Here are my results:
>
> 1.) Killing the job on the executor machine
> Outcome: Condor returned an exit code of 1

This is the desired behaviour

> 2.) Shuting down the condor deamons on the executor
> Outcome: Condor restarted the job on another machine -- WOW, is this
> standard behaviour of Condor?!
> I never saw that.

This is again the desired behaviour - with the notable exception of
disks dying (see recent post) condor is very well behaved for an
execute machine stopping itself nicely.

> 3.) Shutting down the NIC on the executor (I assume same as pulling the
> plug)
> Outcome: Condor hangs, a shadow process is existing all the time
> I even cannot remove the job with condor_rm!
> Maybe a bug? what can I do?

condor_rm -forcex may get rid of it (you may need to kill off the
shadow by hand, it should eventually timeout though, how long did you
give it?).

Some older versions in the 6.6 series, at least on windows, were very
poor at cleaning up dead jobs when the execute machine stopped
responding at all.

Matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR