[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG questions



On Fri, 18 Dec 2009, Ian Stokes-Rees wrote:

I had a few DAG-specific questions to follow up on.  I have increased my
file handle and process limits to 40k and 20k respectively.

Ian Stokes-Rees wrote:
In particular, I'm trying to create a 100k node DAG (flat, no
dependencies), with MAXJOBS 6000 and I'm getting the error:
...
These are in 100k separate classads in 100k directories (in a 2-tier
hierarchy groupX/nodeY, so as to avoid overloading a single
directory), with 100k log files in each of the node directories.

It takes about 1 hour for the DAG to be submitted.  I've bumped up
ulmits to a level which should get rid of the problem, but it isn't
clear if I need to re-submit the DAG, restart Condor, logout/login, or
even reboot the machine to have these changes come into effect.  Any
advice kindly appreciated.

One thing to think about is upgrading from a 7.0.x DAGMan to a newer
version -- 7.4.0 (or wait for 7.4.1 if you're on Windows).  You can
change the version of DAGMan indpendently of the version of the rest
of your Condor setup -- just slide in new condor_dagman and
condor_submit_dag binaries.  If you go to a newer DAGMan, submission of
large DAGs will be much faster I just ran a test, and it took 3.8 seconds to submit a 100k node DAG with a 7.4.1 DAGMan.

There are also improvements in performance for the actual job submission,
and more knobs you can tweak to optimize that.

I've read and re-read some of the DAGMan documentation.  I've now set:

DAGMAN_MAX_SUBMITS_PER_INTERVAL=250
DAGMAN_LOG_ON_NFS_IS_ERROR=False

The latter is surprising since I understand the default is "True", but
my jobs were submitted OK (docs for 7.0 say this should cause DAG
failure).  All my job files are on NFS.  I don't have space on local
disk for the 20+ GB this DAG will produce on each iteration.  I'm using
Condor 7.2.  I should also mention that I have DOT generation turned on
and set to UPDATE.  This may not be a good idea.  In the short term I
can move job submission to a local disk for testing, and turn off DOT
generation.

Hmm, if your node job log files were on NFS I don't know why it worked before. One note about this -- DAGMan only cares about the log files being on NFS -- and they shouldn't be that big. I assume most of the 20GB is output files, and DAGMan doesn't care where those are.

The problem with having the node job log files on NFS is that file locking doesn't work reliably on NFS, and once in a while the log reading gets goofed up in such a way that DAGMan hangs, because it can't see any events in the log files.

My dagman.out file is huge: 200 MB.  Is there some way to reduce the
logging level?  I couldn't see any option to do this.  I seem to get one
line per DAG node every time DAGMan re-evaluates the DAG.  100k lines
every few minutes is too much.  My ideal scenario:

1. Specify the location of the DAG log, out, and err files explicitly
(rather than have them end up in the directory where condor_submit_dag
is executed).

You can force the dagman.out file to a different directory by using the
-outfile_dir <directory> command-line argument to condor_submit_dag.

The log and err files from DAGMan itself are usually very small.

If you really need to change things around, you can do 'condor_submit_dag -no_submit ...', manually edit the .condor.sub file, and then do 'condor_submit whatever.condor.sub'.

2. Limit logging to remove per-DAG-node lines.

You can control this at least somewhat using the condor_submit_dag -debug <level> command-line flag. The default level is 3; if you set it to 0 or 1 you'll drastically cut down on the output (but of course it will be a lot harder to diagnose the problem if anything goes wrong).

3. Log rotate files that could grow big

Right now there's no provision for rotating the dagman.out file.

Finally condor_submit_dag seems to be silent while it processes the
DAG.  I don't want a flood of output, but it would be nice to know
*something* is going on.  Instead it outputs nothing for 60 minutes,
then dumps the status of the DAG submission.

See the note above about upgrading to a newer version of DAGMan -- you won't get more output, but hopefully it will be fast enough that it doesn't really matter.

Thanks for advice on how to improve our use of DAGMan.

You're welcome!

Kent Wenger
Condor Team