[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] negotiating with schedds when a client has FW
- Date: Fri, 24 Jun 2005 12:20:40 -0500
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] negotiating with schedds when a client has FW
On Mon, Jun 20, 2005 at 07:14:45PM +1000, Christopher Mellen wrote:
> A 'workaround' for the problem described below can be effected by setting
> NEGOTIATE_ALL_JOBS_IN_CLUSTER = True
> in section 4 of the condor_config file. Nominally this macro is FALSE. See
> the comments in the config file for why this works and some of the potential
> problems associated.
> In practice we've found that use of this setting works very well for us. FYI
> our cluster size is ~50 machines.
Just to be clear - the two "clusters" here are different things.
NEGOTIATE_ALL_JOBS_IN_CLUSTER means for each job in a job cluster (ie
someone did "queue 50" in their submit file to get a cluster of 50 jobs) the
schedd should send each job in that cluster to the matchmaker. By default,
the schedd makes an optimzation in that it stops matchmaking for a cluster
of jobs the first time one of the jobs are rejected with "No Match Found".
Traditionally, the requirements expressions for each job in a cluster
is the same, so if one is rejected they're all likely to be rejected.
It has nothing to do with the number of computers in your cluster.
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Christensen
> Sent: Saturday, 18 June 2005 5:51 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] negotiating with schedds when a client has FW
> Andrey Kaliazin <A.Kaliazin@xxxxxxxxxxx> writes:
> > Schedd is fine here, it provides the string of jobs to run and just waits
> > patiently, while Negotiator
> > dispatches them. If Start daemons respond properly everything is fine.
> > But, if one of the compute nodes which appears on top of the matched list
> > fails for various reasons
> > (mainly networking problems in our case) then Negotiator would not just
> > dismiss it and get the next
> > best node, but halts the whole cycle.
> > And couple of minutes later, in the next cycle the story repeats itself,
> > because this faulty node is still on top of the list.
> This sounds like exactly the same problem we run into frequently here.
> Our machines are administered by various individuals, and firewalls
> are often accidentally closed or other problems happen, and until they
> are fixed the cluster is barely usable. Sometimes the admin is away
> and I don't have the power to fix the problem or even to turn off the
> machine! It would be nice if Condor handled such situations gracefully.
> Condor-users mailing list
> Condor-users mailing list