I'm running a private Condor pool of 18 cpu slots (5 machines, 4 quad core, 1 dual core). I successfully ran batches of 10, 100, and 1000 jobs at a time using dagman, then tried to run "a full run" of everything which was 6115 jobs, and Condor failed at 1015 jobs. Of course I only tested up to 1000, silly me. It (hopefully) is a configuration issue that can easily be resolved. In my dagfile.dag.dagman.out, I found this error: 2/21 07:15:41 Of 6115 nodes total: 2/21 07:15:41 Done Pre Queued Post Ready Un-Ready Failed 2/21 07:15:41 === === === === === === === 2/21 07:15:41 294 0 721 0 5100 0 0 2/21 07:15:46 Submitting Condor Node PROV5126_STUDISABILITYSERVICE job(s)... 2/21 07:15:46 submitting: condor_submit -a dag_node_name' '=' 'PROV5126_STUDISABILITYSERVICE -a +DAGManJobId' '=' '1293 -a DAGManJobId' '=' '1293 -a submit_event_notes' '=' 'DAG' 'Node:' 'PROV5126_STUDISABILITYSERVICE -a +DAGParentNodeNames' '=' '"" /usr/local/files/condorSubmitFiles/PROV5126_STUDISABILITYSERVICE.bbCondor 2/21 07:15:46 From submit: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c 2/21 07:15:46 failed while reading from pipe. 2/21 07:15:46 Read so far: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c 2/21 07:15:46 ERROR: submit attempt failed 2/21 07:15:46 submit command was: condor_submit -a dag_node_name' '=' 'PROV5126_STUDISABILITYSERVICE -a +DAGManJobId' '=' '1293 -a DAGManJobId' '=' '1293 -a submit_event_notes' '=' 'DAG' 'Node:' 'PROV5126_STUDISABILITYSERVICE -a +DAGParentNodeNames' '=' '"" /usr/local/files/condorSubmitFiles/PROV5126_STUDISABILITYSERVICE.bbCondor The remaining Queued files finished, but the 5100 Ready jobs never did. I started this at 6:40 AM this morning and killed it around 2:35 PM. The last of the Queued jobs was done by 9:30 AM. condor_q only had the dagman job in it (running), no other jobs in queue. 2/21 14:35:28 Of 6115 nodes total: 2/21 14:35:28 Done Pre Queued Post Ready Un-Ready Failed 2/21 14:35:28 === === === === === === === 2/21 14:35:28 1015 0 0 0 4575 0 525 2/21 14:35:33 Submitting Condor Node CES_0217_DEPARTMENT_WORKGROUP job(s)... 2/21 14:35:33 submitting: condor_submit -a dag_node_name' '=' 'CES_0217_DEPARTMENT_WORKGROUP -a +DAGManJobId' '=' '1293 -a DAGManJobId' '=' '1293 -a submit_event_notes' '=' 'DAG' 'Node:' 'CES_0217_DEPARTMENT_WORKGROUP -a +DAGParentNodeNames' '=' '"" /usr/local/files/condorSubmitFiles/CES_0217_DEPARTMENT_WORKGROUP.bbCondor 2/21 14:35:33 From submit: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c 2/21 14:35:33 failed while reading from pipe. 2/21 14:35:33 Read so far: Submitting job(s)ERROR "Unable to open null file (/dev/null). Needed for formatting purposes. errno=24 (Too many open files)" at line 156 in file condor_snutils.c 2/21 14:35:33 ERROR: submit attempt failed 2/21 14:35:42 Received SIGUSR1 2/21 14:35:42 Aborting DAG... 2/21 14:35:42 Writing Rescue DAG to /usr/local/CMSIntegration/files/dagFile20090221.dag.rescue001... 2/21 14:35:42 Note: 0 total job deferrals because of -MaxJobs limit (0) 2/21 14:35:42 Note: 0 total job deferrals because of -MaxIdle limit (0) 2/21 14:35:42 Note: 0 total job deferrals because of node category throttles 2/21 14:35:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 2/21 14:35:42 Note: 0 total POST script deferrals because of -MaxPost limit (0) 2/21 14:35:42 **** condor_scheduniv_exec.1293.0 (condor_DAGMAN) pid 14419 EXITING WITH STATUS 2 I'll read the manual more on Monday, but all help is greatly appreciated. Thanks, Sam Hoover |