[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Too many popen() calls in DAGMan ?



On Wed, 6 Sep 2006, Masakatsu Ito wrote:

> I'm using DAGMan to perform a set of simulations
> with different parameters. DAGMan has worked well
> with a small set of simulations, but when I try
> to perform a larger set, it stopped with an error
> message in its .dagman.out file, like :
>
> >9/6 00:45:28 Submitting Condor Job f1s5v13t ...
> >9/6 00:45:28 submitting: condor_submit  -a 'dag_node_name = f1s5v13t' -a '+DAGMa
> >nJobID = 17168' -a 'submit_event_notes = DAG Node: f1s5v13t' -a 'currname = fram
> >e1' -a 'prevname = frame0' -a 'ndx = group.ndx' -a '+DAGParentNodeNames = "f0s5v
> >13"' SAMPLE5/VDW13/tpbconv.submit 2>&1
> >9/6 00:45:28 condor_submit  -a 'dag_node_name = f1s5v13t' -a '+DAGManJobID = 171
> >68' -a 'submit_event_notes = DAG Node: f1s5v13t' -a 'currname = frame1' -a 'prev
> >name = frame0' -a 'ndx = group.ndx' -a '+DAGParentNodeNames = "f0s5v13"' SAMPLE5
> >/VDW13/tpbconv.submit 2>&1: popen() in submit_try failed!
> >9/6 00:45:28 ERROR: submit attempt failed
> >
> >
>
> So I guess my simulations make DAGMan create
> too many processes by invoking popen().

Actually, DAGMan only has one popen() stream open at a time.  DAGMan uses
popen() to run condor_submit, and closes the stream as soon as
condor_submit returns.  And the submitting is not multi-threaded, so no
matter how many nodes your DAG has, you'll just have one popen() going at
a time.

> Specifically, I'm trying to run 396 molecular dynamics
> simulations. Each simulation is divided into 20 time frames,
> so that analysis program can be run after each time frame.
> Hence my .dag file has 16341 nodes
> ( = (simulations + analysis ) * time frames + additional analysis)
> and 396 simulations and 396 analysis programs are submitted
> simultaneously to CONDOR whose pool has 128 CPUs.
>
> Could anybody please tell me if this size of simulations
> can exceed the limit of DAGMan ? Or the older version of
> DAGMan in CONDR 6.7.14 can easily create more processes
> that the latest version ? (Actually this older version
> is installed in our system.)

There's no hard limit on the maximum number of nodes.  In 6.7.14 one thing
you'll run into if you have a really big DAG is that it will take a long
time to parse (this has been fixed in the most recent few versions).  The
bigger concern is overloading the Condor negotiator if you have many nodes
that are "siblings".  You can avoid this by using the -maxjobs or -maxidle
flags (see the manual).

> I'm wondering if I should inspect some other aspects
> in the CONDOR log files, or should ask our system administrator
> to update our CONDOR to the latest version.
>
> I'd be very grateful for any hint, advice, or comments.
>
> Thanks in advance.

As some other people have noted, people have successfully run DAGs that
are bigger than yours.

I'd suggest upgrading to the latest DAGMan as a first step.  Note that
you can upgrade DAGMan without upgrading the whole Condor installation.
Just get the appropriate tarball and extract the condor_dagman and
condor_submit_dag binaries.  You can just put those binaries somewhere
that's earlier in your PATH than the main Condor bin directory, and you'll
pick them up automatically.

Try running the 6.8.0 condor_dagman and condor_submit_dag, and email again
if you still have problems.

Kent Wenger
Condor Team