[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] NFS errors with log file

On Thu, 5 Jun 2008, R. Kent Wenger wrote:

On Thu, 5 Jun 2008, Steven Timm wrote:

Kent--what about the case where a complicated DAG is being parsed by
more than one schedd, which could in extremes even be on
different machines? Is there any thought being given to this possibility
when extending DAGMAN?

Hmm, at this point we haven't really thought much about spreading a big
DAG across multiple schedds and/or machines.

The biggest obstacle to DAG scaling right now is just memory use within
DAGMan itself -- we have some ideas on how to tackle that.

Kent Wenger
Condor Team

We already have a case here at Fermilab where we are spreading
a big DAG across multiple schedd's but right now they are still
all physically located on the same machine.  The main schedd
runs the scheduler universe job itself but one of four secondary
schedd's manages all the processes of the dag.
But this does require that those schedd's all be able to write the
log area of the main schedd as a local disk.  Eventually we would
like to split those schedd's out to a separate virtual machine.
This is one of those complex dags where overriding the
nfs error is the wrong thing to do, we have learned this the hard way.
Other global file systems like GFS are no better at locking
than NFS, in fact they are worse.  This is a major limitation
in scaling up to large multi-thousand stage dags.  Our usual limit
right now is about 1000.


Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.