[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] negotiating with schedds when a client has FW



Andrey Kaliazin <A.Kaliazin@xxxxxxxxxxx> writes:

> Schedd is fine here, it provides the string of jobs to run and just waits
> patiently, while Negotiator
> dispatches them. If Start daemons respond properly everything is fine.
> But, if one of the compute nodes which appears on top of the matched list
> fails for various reasons 
> (mainly networking problems in our case) then Negotiator would not just
> dismiss it and get the next 
> best node, but halts the whole cycle. 
> And couple of minutes later, in the next cycle the story repeats itself, 
> because this faulty node is still on top of the list. 

This sounds like exactly the same problem we run into frequently here.
Our machines are administered by various individuals, and firewalls
are often accidentally closed or other problems happen, and until they
are fixed the cluster is barely usable.  Sometimes the admin is away
and I don't have the power to fix the problem or even to turn off the
machine!  It would be nice if Condor handled such situations gracefully.

Dan