Re: [Condor-users] NFS errors with log file

On Thu, Jun 05, 2008 at 04:54:46PM -0500, Steven Timm wrote:
> On Thu, 5 Jun 2008, R. Kent Wenger wrote:
> > On Thu, 5 Jun 2008, Steven Timm wrote:
> >
> >> Kent--what about the case where a complicated DAG is being parsed by
> >> more than one schedd, which could in extremes even be on
> >> different machines? Is there any thought being given to this possibility
> >> when extending DAGMAN?
> >
> > Hmm, at this point we haven't really thought much about spreading a big
> > DAG across multiple schedds and/or machines.
> >
> > The biggest obstacle to DAG scaling right now is just memory use within
> > DAGMan itself -- we have some ideas on how to tackle that.
> >
> > Kent Wenger
> > Condor Team
> We already have a case here at Fermilab where we are spreading
> a big DAG across multiple schedd's but right now they are still
> all physically located on the same machine.  The main schedd
> runs the scheduler universe job itself but one of four secondary
> schedd's manages all the processes of the dag.
> But this does require that those schedd's all be able to write the
> log area of the main schedd as a local disk.  Eventually we would
> like to split those schedd's out to a separate virtual machine.
> This is one of those complex dags where overriding the
> nfs error is the wrong thing to do, we have learned this the hard way.
> Other global file systems like GFS are no better at locking
> than NFS, in fact they are worse.  This is a major limitation
> in scaling up to large multi-thousand stage dags.  Our usual limit
> right now is about 1000.


We are successfully running O(100k) node DAGs in LIGO using the existing
7.0.1 schedd and dagman scalability enhancements with just a single schedd.
I am curious what limitation you are running into with your large dags
on a single schedd?  Are you using an older 6.8.x version?


Stuart Anderson  anderson@xxxxxxxxxxxxxxxx