[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] negotiating with schedds when a client has FW



On Mon, Jun 20, 2005 at 07:14:45PM +1000, Christopher Mellen wrote:
> A 'workaround' for the problem described below can be effected by setting 
> 
> NEGOTIATE_ALL_JOBS_IN_CLUSTER = True
> 
> in section 4 of the condor_config file. Nominally this macro is FALSE. See
> the comments in the config file for why this works and some of the potential
> problems associated. 
> 
> In practice we've found that use of this setting works very well for us. FYI
> our cluster size is ~50 machines.
> 

Just to be clear - the two "clusters" here are different things. 
NEGOTIATE_ALL_JOBS_IN_CLUSTER means for each job in a job cluster (ie
someone did "queue 50" in their submit file to get a cluster of 50 jobs) the 
schedd should send each job in that cluster to the matchmaker. By default,
the schedd makes an optimzation in that it stops matchmaking for a cluster
of jobs the first time one of the jobs are rejected with "No Match Found". 
Traditionally, the requirements expressions for each job in a cluster
is the same, so if one is rejected they're all likely to be rejected. 

It has nothing to do with the number of computers in your cluster.

-Erik

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Christensen
> Sent: Saturday, 18 June 2005 5:51 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] negotiating with schedds when a client has FW
> 
> Andrey Kaliazin <A.Kaliazin@xxxxxxxxxxx> writes:
> 
> > Schedd is fine here, it provides the string of jobs to run and just waits
> > patiently, while Negotiator
> > dispatches them. If Start daemons respond properly everything is fine.
> > But, if one of the compute nodes which appears on top of the matched list
> > fails for various reasons 
> > (mainly networking problems in our case) then Negotiator would not just
> > dismiss it and get the next 
> > best node, but halts the whole cycle. 
> > And couple of minutes later, in the next cycle the story repeats itself, 
> > because this faulty node is still on top of the list. 
> 
> This sounds like exactly the same problem we run into frequently here.
> Our machines are administered by various individuals, and firewalls
> are often accidentally closed or other problems happen, and until they
> are fixed the cluster is barely usable.  Sometimes the admin is away
> and I don't have the power to fix the problem or even to turn off the
> machine!  It would be nice if Condor handled such situations gracefully.
> 
> Dan
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users