[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Too many popen() calls in DAGMan ?



Thank you for your prompt response, and advices.

Now I've checked how many process our PC can have, and
if popen() -related feature of DAGMan is changed
in later versions.

So I guess my simulations make DAGMan create
too many processes by invoking popen().

I would think it more likely that the processes created by the shadows
for jobs running (guessing you get a lot of the pool sometimes - lucky
you!) is eating up some user/box process limit.
What is your max process limit on your machine?

We use Linux PC to manage  condor  pool,  so  I check  its
/proc/sys/kernel/threads-max . The maximum number of processes
is 32474 .

Is this too small for my simulations, where .dag file consists of 16341 nodes
and 396*2 jobs are submitted to CONDOR simultaneously?

As to the version you might want to take a look at the BugFixes in
http://www.cs.wisc.edu/condor/manual/v6.7/8_3Development_Release.html
to see if there is any thing about DAGMan you should know

I find that version 6.7.19 of condor_dagman no longer uses
the popen() system call.  So, if the usage of popen() causes
the failure of job submissions,  the later version of DAGMan would
complete my simulations.  But shadow daemon may eat up
our process/thread limit (kernel.threads-max = 32474).

hmmm, I'm a bit confused...  I'd appreciate more help or hints.

ps. I've submitted the rescue DAG and see that the jobs
which codor failed to submit  are running.


Matt Hope wrote:

On 9/6/06, Masakatsu Ito <m-ito@xxxxxxxxxxxxxx> wrote:

Dear all,

I'm using DAGMan to perform a set of simulations
with different parameters. DAGMan has worked well
with a small set of simulations, but when I try
to perform a larger set, it stopped with an error
message in its .dagman.out file, like :

>9/6 00:45:28 Submitting Condor Job f1s5v13t ...
>9/6 00:45:28 submitting: condor_submit -a 'dag_node_name = f1s5v13t' -a '+DAGMa >nJobID = 17168' -a 'submit_event_notes = DAG Node: f1s5v13t' -a 'currname = fram >e1' -a 'prevname = frame0' -a 'ndx = group.ndx' -a '+DAGParentNodeNames = "f0s5v
>13"' SAMPLE5/VDW13/tpbconv.submit 2>&1
>9/6 00:45:28 condor_submit -a 'dag_node_name = f1s5v13t' -a '+DAGManJobID = 171 >68' -a 'submit_event_notes = DAG Node: f1s5v13t' -a 'currname = frame1' -a 'prev >name = frame0' -a 'ndx = group.ndx' -a '+DAGParentNodeNames = "f0s5v13"' SAMPLE5
>/VDW13/tpbconv.submit 2>&1: popen() in submit_try failed!
>9/6 00:45:28 ERROR: submit attempt failed
>
>

So I guess my simulations make DAGMan create
too many processes by invoking popen().


I would think it more likely that the processes created by the shadows
for jobs running (guessing you get a lot of the pool sometimes - lucky
you!) is eating up some user/box process limit.
What is your max process limit on your machine?

This is a guess though. I don't know enough about DAGman to know if it
cases a lot of process creation internally.

Could anybody please tell me if this size of simulations
can exceed the limit of DAGMan ? Or the older version of
DAGMan in CONDR 6.7.14 can easily create more processes
that the latest version ? (Actually this older version
is installed in our system.)


There are plently of people using DAGMan to submit thousands of jobs
(though they tend to make sure they only have a few hundred jobs in
the queue at any one time for performance).

As to the version you might want to take a look at the BugFixes in
http://www.cs.wisc.edu/condor/manual/v6.7/8_3Development_Release.html
to see if there is any thing about DAGMan you should know

Matt



--
Masakatsu Ito

Nanotechnology Research Center
FUJITSU LABORATORIES LTD.