[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] NFS errors with log file



On Thu, 5 Jun 2008, Steven Timm wrote:

We already have a case here at Fermilab where we are spreading
a big DAG across multiple schedd's but right now they are still
all physically located on the same machine.  The main schedd
runs the scheduler universe job itself but one of four secondary
schedd's manages all the processes of the dag.
But this does require that those schedd's all be able to write the
log area of the main schedd as a local disk.  Eventually we would
like to split those schedd's out to a separate virtual machine.
This is one of those complex dags where overriding the
nfs error is the wrong thing to do, we have learned this the hard way.
Other global file systems like GFS are no better at locking
than NFS, in fact they are worse.  This is a major limitation
in scaling up to large multi-thousand stage dags.  Our usual limit
right now is about 1000.

Okay, we'll have to take that type of setup into account.

We need to do some thinking about ways to possibly spread things across several machines.

Kent Wenger
Condor Team