[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dagman job failed with (Too many open files)



I'm running a private Condor pool of 18 cpu slots (5 machines, 4 quad core, 1 dual core). I successfully ran batches of 10, 100, and 1000 jobs at a time using dagman, then tried to run "a full run" of everything which was 6115 jobs, and Condor failed at 1015 jobs. Of course I only tested up to 1000, silly me.

It (hopefully) is a configuration issue that can easily be resolved. In my dagfile.dag.dagman.out, I found this error:

2/21 07:15:41 Of 6115 nodes total:
2/21 07:15:41  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/21 07:15:41   ===     ===      ===     ===     ===        ===      ===
2/21 07:15:41   294       0      721       0    5100          0        0
2/21 07:15:46 Submitting Condor Node PROV5126_STUDISABILITYSERVICE job(s)...
2/21 07:15:46 submitting: condor_submit -a dag_node_name' '=' 'PROV5126_STUDISABILITYSERVICE -a +DAGManJobId' '=' '1293 -a DAGManJobId' '=' '1293 -a submit_event_notes' '=' 'DAG' 'Node:' 'PROV5126_STUDISABILITYSERVICE -a +DAGParentNodeNames' '=' '"" /usr/local/files/condorSubmitFiles/PROV5126_STUDISABILITYSERVICE.bbCondor
2/21 07:15:46 From submit: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c
2/21 07:15:46 failed while reading from pipe.
2/21 07:15:46 Read so far: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c
2/21 07:15:46 ERROR: submit attempt failed 
2/21 07:15:46 submit command was: condor_submit -a dag_node_name' '=' 'PROV5126_STUDISABILITYSERVICE -a +DAGManJobId' '=' '1293 -a DAGManJobId' '=' '1293 -a submit_event_notes' '=' 'DAG' 'Node:' 'PROV5126_STUDISABILITYSERVICE -a +DAGParentNodeNames' '=' '"" /usr/local/files/condorSubmitFiles/PROV5126_STUDISABILITYSERVICE.bbCondor

The remaining Queued files finished, but the 5100 Ready jobs never did. I started this at 6:40 AM this morning and killed it around 2:35 PM. The last of the Queued jobs was done by 9:30 AM. condor_q only had the dagman job in it (running), no other jobs in queue.

2/21 14:35:28 Of 6115 nodes total:
2/21 14:35:28  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/21 14:35:28   ===     ===      ===     ===     ===        ===      ===
2/21 14:35:28  1015       0        0       0    4575          0      525
2/21 14:35:33 Submitting Condor Node CES_0217_DEPARTMENT_WORKGROUP job(s)...
2/21 14:35:33 submitting: condor_submit -a dag_node_name' '=' 'CES_0217_DEPARTMENT_WORKGROUP -a +DAGManJobId' '=' '1293 -a DAGManJobId' '=' '1293 -a submit_event_notes' '=' 'DAG' 'Node:' 'CES_0217_DEPARTMENT_WORKGROUP -a +DAGParentNodeNames' '=' '"" /usr/local/files/condorSubmitFiles/CES_0217_DEPARTMENT_WORKGROUP.bbCondor
2/21 14:35:33 From submit: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c
2/21 14:35:33 failed while reading from pipe.
2/21 14:35:33 Read so far: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c
2/21 14:35:33 ERROR: submit attempt failed
2/21 14:35:42 Received SIGUSR1
2/21 14:35:42 Aborting DAG...
2/21 14:35:42 Writing Rescue DAG to /usr/local/CMSIntegration/files/dagFile20090221.dag.rescue001...
2/21 14:35:42 Note: 0 total job deferrals because of -MaxJobs limit (0)
2/21 14:35:42 Note: 0 total job deferrals because of -MaxIdle limit (0)
2/21 14:35:42 Note: 0 total job deferrals because of node category throttles
2/21 14:35:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
2/21 14:35:42 Note: 0 total POST script deferrals because of -MaxPost limit (0)
2/21 14:35:42 **** condor_scheduniv_exec.1293.0 (condor_DAGMAN) pid 14419 EXITING WITH STATUS 2


I'll read the manual more on Monday, but all help is greatly appreciated.
Thanks,
Sam Hoover
CSO, CCIT
Clemson University, Clemson, SC