[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] failed with errno 24 (Too many open files)



Hello,

 

i had to put on hold 5 dags on 2 different submit hosts ( 10 dags in total )

last friday in order to not overload my nfs mounted home filesystem. Later I released the dags and the underlaying jobs, which seemed to work fine, i.e. the individual jobs previously on hold completed. However for a while the dagman jobs are idle, neither starting new jobs or exiting. Looking at the dagman.out file of the dags i find repretition of the following:

 

 

09/06/13 15:50:20 ******************************************************

09/06/13 15:50:20 ** condor_scheduniv_exec.9350903.0 (CONDOR_DAGMAN) STARTING UP

09/06/13 15:50:20 ** /usr/bin/condor_dagman

09/06/13 15:50:20 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)

09/06/13 15:50:20 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON

09/06/13 15:50:20 ** $CondorVersion: 7.8.6 Oct 24 2012 BuildID: 73238 $

09/06/13 15:50:20 ** $CondorPlatform: x86_64_deb_6.0 $

09/06/13 15:50:20 ** PID = 16768

09/06/13 15:50:20 ** Log last touched 9/6 15:45:36

09/06/13 15:50:20 ******************************************************

09/06/13 15:50:20 Using config source: /etc/condor/condor_config

09/06/13 15:50:20 Using local config sources:

09/06/13 15:50:20 /etc/default/condor|

09/06/13 15:50:20 DaemonCore: command socket at <10.20.30.3:41003>

09/06/13 15:50:20 DaemonCore: private command socket at <10.20.30.3:41003>

09/06/13 15:50:20 Setting maximum accepts per cycle 8.

09/06/13 15:50:20 DAGMAN_USE_STRICT setting: 0

09/06/13 15:50:20 DAGMAN_VERBOSITY setting: 3

09/06/13 15:50:20 DAGMAN_DEBUG_CACHE_SIZE setting: 5242880

09/06/13 15:50:20 DAGMAN_DEBUG_CACHE_ENABLE setting: False

09/06/13 15:50:20 DAGMAN_SUBMIT_DELAY setting: 0

09/06/13 15:50:20 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6

09/06/13 15:50:20 DAGMAN_STARTUP_CYCLE_DETECT setting: False

09/06/13 15:50:20 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 400

09/06/13 15:50:20 DAGMAN_USER_LOG_SCAN_INTERVAL setting: 5

09/06/13 15:50:20 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114

09/06/13 15:50:20 DAGMAN_RETRY_SUBMIT_FIRST setting: True

09/06/13 15:50:20 DAGMAN_RETRY_NODE_FIRST setting: False

09/06/13 15:50:20 DAGMAN_MAX_JOBS_IDLE setting: 500

09/06/13 15:50:20 DAGMAN_MAX_JOBS_SUBMITTED setting: 5000

09/06/13 15:50:20 DAGMAN_MAX_PRE_SCRIPTS setting: 0

09/06/13 15:50:20 DAGMAN_MAX_POST_SCRIPTS setting: 0

09/06/13 15:50:20 DAGMAN_ALLOW_LOG_ERROR setting: False

09/06/13 15:50:20 DAGMAN_MUNGE_NODE_NAMES setting: True

09/06/13 15:50:20 DAGMAN_PROHIBIT_MULTI_JOBS setting: False

09/06/13 15:50:20 DAGMAN_SUBMIT_DEPTH_FIRST setting: True

09/06/13 15:50:20 DAGMAN_ALWAYS_RUN_POST setting: True

09/06/13 15:50:20 DAGMAN_ABORT_DUPLICATES setting: True

09/06/13 15:50:20 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: True

09/06/13 15:50:20 DAGMAN_PENDING_REPORT_INTERVAL setting: 600

09/06/13 15:50:20 DAGMAN_AUTO_RESCUE setting: True

09/06/13 15:50:20 DAGMAN_MAX_RESCUE_NUM setting: 100

09/06/13 15:50:20 DAGMAN_WRITE_PARTIAL_RESCUE setting: True

09/06/13 15:50:20 DAGMAN_DEFAULT_NODE_LOG setting: null

09/06/13 15:50:20 DAGMAN_GENERATE_SUBDAG_SUBMITS setting: True

09/06/13 15:50:20 ALL_DEBUG setting:

09/06/13 15:50:20 DAGMAN_DEBUG setting:

09/06/13 15:50:20 argv[0] == "condor_scheduniv_exec.9350903.0"

09/06/13 15:50:20 argv[1] == "-Lockfile"

09/06/13 15:50:20 argv[2] == "2_fsnlopt.dag.lock"

09/06/13 15:50:20 argv[3] == "-AutoRescue"

09/06/13 15:50:20 argv[4] == "1"

09/06/13 15:50:20 argv[5] == "-DoRescueFrom"

09/06/13 15:50:20 argv[6] == "0"

09/06/13 15:50:20 argv[7] == "-Dag"

09/06/13 15:50:20 argv[8] == "2_fsnlopt.dag"

09/06/13 15:50:20 argv[9] == "-CsdVersion"

09/06/13 15:50:20 argv[10] == "$CondorVersion: 7.8.6 Oct 24 2012 BuildID: 73238 $"

09/06/13 15:50:20 argv[11] == "-Dagman"

09/06/13 15:50:20 argv[12] == "/usr/bin/condor_dagman"

09/06/13 15:50:20 Default node log file is: </home/shaltev/MDMS_200_1/sub/offsignal/chain_6/2_fsnlopt.dag.nodes.log>

09/06/13 15:50:20 DAG Lockfile will be written to 2_fsnlopt.dag.lock

09/06/13 15:50:20 DAG Input file is 2_fsnlopt.dag

09/06/13 15:50:20 Ignoring value of DAGMAN_LOG_ON_NFS_IS_ERROR.

09/06/13 15:50:20 Parsing 1 dagfiles

09/06/13 15:50:20 Parsing 2_fsnlopt.dag ...

09/06/13 15:50:20 Dag contains 2500 total jobs

09/06/13 15:50:20 Lock file 2_fsnlopt.dag.lock detected,

09/06/13 15:50:20 Duplicate DAGMan PID 16008 is no longer alive; this DAGMan should continue.

09/06/13 15:50:20 Sleeping for 12 seconds to ensure ProcessId uniqueness

09/06/13 15:50:32 Bootstrapping...

09/06/13 15:50:32 Number of pre-completed nodes: 0

09/06/13 15:50:32 Running in RECOVERY mode... >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

09/06/13 15:50:34 MultiLogFiles::readFileToString: safe_fopen_wrapper_follow(/home/shaltev/MDMS_200_1/output/MDMS_200_1_chain_6_offsignal_508_cand/dag/fsnlopt2_H1L1.submit) failed with errno 24 (Too many open files)

09/06/13 15:50:34 MultiLogFiles: Unable to read file: /home/shaltev/MDMS_200_1/output/MDMS_200_1_chain_6_offsignal_508_cand/dag/fsnlopt2_H1L1.submit

09/06/13 15:50:34 Unable to get log file from submit file /home/shaltev/MDMS_200_1/output/MDMS_200_1_chain_6_offsignal_508_cand/dag/fsnlopt2_H1L1.submit (node fsnlopt2_508); using default (/home/shaltev/MDMS_200_1/sub/offsignal/chain_6/2_fsnlopt.dag.nodes.log)

09/06/13 15:50:34 DAGMan::Job:8001:ERROR: Unable to monitor log file for node fsnlopt2_508|ReadMultipleUserLogs:9004:Error getting file ID in monitorLogFile()|ReadMultipleUserLogs:9004:Error initializing log file /home/shaltev/MDMS_200_1/sub/offsignal/chain_6/2_fsnlopt.dag.nodes.log|MultiLogFiles:9001:Error (24, Too many open files) opening file /home/shaltev/MDMS_200_1/sub/offsignal/chain_6/2_fsnlopt.dag.nodes.log for creation or truncation

09/06/13 15:50:34 ERROR "Fatal log file monitoring error!

" at line 858 in file /slots/01/dir_16105/userdir/src/condor_dagman/job.cpp

 

 

However I can open files on the submit host, i.e., I am able to read

 

/home/shaltev/MDMS_200_1/output/MDMS_200_1_chain_6_offsignal_508_cand/dag/fsnlopt2_H1L1.submit

 

The error in the log is the same for all 10 dags, except for the file, that cannot be read:

 

/home/shaltev/MDMS_200_X/output/MDMS_200_X_chain_Y_offsignal_508_cand/dag/fsnlopt2_H1L1.submit

 

where X \in [0,1] and Y \in [6,10].

 

I do not want to remove and resubmit the dags yet, as I do not understand what is going on. I am afraid, that removing the dags now would mean restarting them for the very begining, something that I try to avoid.

 

? Any ideas how to proceed further

 

Thanks,

miroslav

 

 

--

Miroslav Shaltev

Albert Einstein Institute

Callinstr 38

D-30167 Hannover, Germany

 

Phone: +49-(0)511-762-3437 (room 035)