[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dags and max open files



Ouch...this system just hit the new limit of 32k file descriptors, so there is something definitely not right. From the following crash dump email I wonder if there is a FD leak in the schedd when DAGMan tries to write to a read-only filesystem. Note, errno=30 is usually EROFS (Read-only file system).

Thanks.

> This is an automated email from the Condor system
> on machine "ldas-grid.ligo-la.caltech.edu".  Do not reply.
> 
> "/usr/sbin/condor_schedd" on "ldas-grid.ligo-la.caltech.edu" exited with status 44.
> Condor will automatically restart this process in 10 seconds.
> 
> *** Last 200 line(s) of file /var/log/condor/SchedLog:
...
> 08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.out failed, errno 30
> 08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.err failed, errno 30
> 08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.out failed, errno 30
> 08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.err failed, errno 30
> 08/12/22 19:05:30 (pid:3194561) Open of lalinference_1000000000-1000437000.dag.lib.out failed, errno 30
> 08/12/22 19:05:30 (pid:3194561) Open of lalinference_1000000000-1000437000.dag.lib.err failed, errno 30
> 08/12/22 19:05:30 (pid:3194561) Open of /local/condor/spool/local_univ_execute/dir_82763974_0/.job.ad failed (Too many open files, errno=24).
> 08/12/22 19:05:30 (pid:3194561) Failed to read job environment: ERROR: Missing '=' after environment variable '-Wl,--sort-common'.
> 08/12/22 19:05:30 (pid:3194561) Create_Pipe(): call to pipe() failed
> 08/12/22 19:05:30 (pid:3194561) ERROR: Can't create DC pipe for writing job ClassAd to the shadow, aborting
> 08/12/22 19:05:30 (pid:3194561) Create_Pipe(): call to pipe() failed
> 08/12/22 19:05:30 (pid:3194561) ERROR: Can't create DC pipe for writing job ClassAd to the shadow, aborting
> 08/12/22 19:05:30 (pid:3194561) Create_Pipe(): call to pipe() failed
> 08/12/22 19:05:30 (pid:3194561) ERROR: Can't create DC pipe for writing job ClassAd to the shadow, aborting
> 08/12/22 19:05:30 (pid:3194561) Calling HandleReq <HandleChildAliveCommand> (0) for command 60008 (DC_CHILDALIVE) from condor@child <10.13.5.31:11654>
> 08/12/22 19:05:30 (pid:3194561) Return from HandleReq <HandleChildAliveCommand> (handler: 0.000007s, sec: 0.000s, payload: 0.000s)
> **** PANIC -- OUT OF FILE DESCRIPTORS at line 227 in /var/lib/condor/execute/slot1/dir_89584/userdir/.tmp8t7PtB/BUILD/condor-9.0.15/src/condor_io/reli_sock.cpp
> *** End of file SchedLog


> On Aug 12, 2022, at 10:34 AM, Michael Thomas <wart@xxxxxxxxxxx> wrote:
> 
> We recently upgraded to condor 9.0.15 (which may or may not be relevant) and are now seeing some schedds reporting "too many open files", for example:
> 
> 08/12/22 10:43:02 (pid:4627) Daemon::startCommand(INVALIDATE_SUBMITTOR_ADS,...) making connection to <10.13.5.25:9618?alias=ldas-condor.ldas.ligo-la.caltech.edu>
> 08/12/22 10:43:02 (pid:4627) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 24 (Too many open files)
> 08/12/22 10:43:02 (pid:4627) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 24 (Too many open files)
> 08/12/22 10:43:02 (pid:4627) Can't open directory "/etc/condor/tokens.d" as PRIV_ROOT, errno: 24 (Too many open files)
> 08/12/22 10:43:02 (pid:4627) getTokenSigningKey(): read_secure_file(/etc/condor/condor_cred) failed!
> 08/12/22 10:43:02 (pid:4627) TOKEN: No token found.
> 08/12/22 10:43:02 (pid:4627) SECMAN: required authentication with collector ldas-condori failed, so aborting command INVALIDATE_SUBMITTOR_ADS.
> 
> I'm able to work around this by increasing the file descriptor limit on the schedd from the default of 4096 with:
> 
> SCHEDD_MAX_FILE_DESCRIPTORS = 32768
> 
> Looking in /proc/$pid/fd for the condor_schedd process, I see almost all open files are related to user .out, .err, and /dev/null fds from user dagman jobs.
> 
> Is it to be expected that there would be a lot of open files from dagman jobs?
> 
> --Mike
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

--
Stuart Anderson
sba@xxxxxxxxxxx