[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAG questions



I had a few DAG-specific questions to follow up on. I have increased my file handle and process limits to 40k and 20k respectively.

Ian Stokes-Rees wrote:
In particular, I'm trying to create a 100k node DAG (flat, no dependencies), with MAXJOBS 6000 and I'm getting the error:
...
These are in 100k separate classads in 100k directories (in a 2-tier hierarchy groupX/nodeY, so as to avoid overloading a single directory), with 100k log files in each of the node directories.

It takes about 1 hour for the DAG to be submitted. I've bumped up ulmits to a level which should get rid of the problem, but it isn't clear if I need to re-submit the DAG, restart Condor, logout/login, or even reboot the machine to have these changes come into effect. Any advice kindly appreciated.

I've read and re-read some of the DAGMan documentation.  I've now set:

DAGMAN_MAX_SUBMITS_PER_INTERVAL=250
DAGMAN_LOG_ON_NFS_IS_ERROR=False

The latter is surprising since I understand the default is "True", but my jobs were submitted OK (docs for 7.0 say this should cause DAG failure). All my job files are on NFS. I don't have space on local disk for the 20+ GB this DAG will produce on each iteration. I'm using Condor 7.2. I should also mention that I have DOT generation turned on and set to UPDATE. This may not be a good idea. In the short term I can move job submission to a local disk for testing, and turn off DOT generation.

My dagman.out file is huge: 200 MB. Is there some way to reduce the logging level? I couldn't see any option to do this. I seem to get one line per DAG node every time DAGMan re-evaluates the DAG. 100k lines every few minutes is too much. My ideal scenario:

1. Specify the location of the DAG log, out, and err files explicitly (rather than have them end up in the directory where condor_submit_dag is executed).
2. Limit logging to remove per-DAG-node lines.
3. Log rotate files that could grow big

Finally condor_submit_dag seems to be silent while it processes the DAG. I don't want a flood of output, but it would be nice to know *something* is going on. Instead it outputs nothing for 60 minutes, then dumps the status of the DAG submission.

Thanks for advice on how to improve our use of DAGMan.

Ian

--
Ian Stokes-Rees, Research Associate
SBGrid, Harvard Medical School
http://sbgrid.org