Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] NFS errors with log file

Date: Thu, 5 Jun 2008 17:25:08 -0700
From: Stuart Anderson <anderson@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] NFS errors with log file

On Thu, Jun 05, 2008 at 04:54:46PM -0500, Steven Timm wrote:
> On Thu, 5 Jun 2008, R. Kent Wenger wrote:
> 
> > On Thu, 5 Jun 2008, Steven Timm wrote:
> >
> >> Kent--what about the case where a complicated DAG is being parsed by
> >> more than one schedd, which could in extremes even be on
> >> different machines? Is there any thought being given to this possibility
> >> when extending DAGMAN?
> >
> > Hmm, at this point we haven't really thought much about spreading a big
> > DAG across multiple schedds and/or machines.
> >
> > The biggest obstacle to DAG scaling right now is just memory use within
> > DAGMan itself -- we have some ideas on how to tackle that.
> >
> > Kent Wenger
> > Condor Team
> 
> We already have a case here at Fermilab where we are spreading
> a big DAG across multiple schedd's but right now they are still
> all physically located on the same machine.  The main schedd
> runs the scheduler universe job itself but one of four secondary
> schedd's manages all the processes of the dag.
> But this does require that those schedd's all be able to write the
> log area of the main schedd as a local disk.  Eventually we would
> like to split those schedd's out to a separate virtual machine.
> This is one of those complex dags where overriding the
> nfs error is the wrong thing to do, we have learned this the hard way.
> Other global file systems like GFS are no better at locking
> than NFS, in fact they are worse.  This is a major limitation
> in scaling up to large multi-thousand stage dags.  Our usual limit
> right now is about 1000.
> 

Steve,

We are successfully running O(100k) node DAGs in LIGO using the existing
7.0.1 schedd and dagman scalability enhancements with just a single schedd.
I am curious what limitation you are running into with your large dags
on a single schedd?  Are you using an older 6.8.x version?

Thanks.

-- 
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson

Follow-Ups:
- Re: [Condor-users] NFS errors with log file
  - From: Steven Timm

References:
- [Condor-users] NFS errors with log file
  - From: Brent Strong
- Re: [Condor-users] NFS errors with log file
  - From: R. Kent Wenger
- Re: [Condor-users] NFS errors with log file
  - From: Steven Timm
- Re: [Condor-users] NFS errors with log file
  - From: R. Kent Wenger
- Re: [Condor-users] NFS errors with log file
  - From: Steven Timm

Prev by Date: [Condor-users] Need Condor-G manual for Globus Toolkit 4
Next by Date: Re: [Condor-users] NFS errors with log file
Previous by thread: Re: [Condor-users] NFS errors with log file
Next by thread: Re: [Condor-users] NFS errors with log file
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] NFS errors with log file