[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dags and max open files



Another thing I notice is that one of the offending dag jobs that seems to have this issue is using a semicolon as a separator:

Env = "LESSOPEN=||/usr/bin/lesspipe.sh %s;OBJDUMP=/home/redacted.user/mambaforge/envs/igwn-py39/bin/x86_64-conda-linux-gnu-objdump;CONDA...
EnvDelim = ";"
Environment = "HTTP_PROXY=http://squid2:3128 HTTPS_PROXY=https://squid2:3128 LESSOPEN=||/usr/bin/lesspipe.sh %s;OBJDUMP=/home/redacted.user/mambaforge/envs/igwn
-py39/bin/x86_64-conda-linux-gnu-objdump;CONDA...

...but other jobs that I see in the queue use a space as a separator for the environment variables.

I'm wondering if my approach to add environment settings to a user job may not be the right way. Is there an alternate preferred way to augment the environment for all user jobs? Is it sufficient to replace the space in the following transform with $(EnvDelim)?

JOB_TRANSFORM_Proxy @=end
SET Environment "HTTP_PROXY=http://squid2:3128 HTTPS_PROXY=https://squid2:3128 $(My.Env)"
@end

--Mike

 On 8/12/22 19:11, Anderson, Stuart B. wrote:
Ouch...this system just hit the new limit of 32k file descriptors, so there is something definitely not right. From the following crash dump email I wonder if there is a FD leak in the schedd when DAGMan tries to write to a read-only filesystem. Note, errno=30 is usually EROFS (Read-only file system).

Thanks.

This is an automated email from the Condor system
on machine "ldas-grid.ligo-la.caltech.edu".  Do not reply.

"/usr/sbin/condor_schedd" on "ldas-grid.ligo-la.caltech.edu" exited with status 44.
Condor will automatically restart this process in 10 seconds.

*** Last 200 line(s) of file /var/log/condor/SchedLog:
...
08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.out failed, errno 30
08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.err failed, errno 30
08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.out failed, errno 30
08/12/22 19:05:30 (pid:3194561) Open of run/multidag.dag.lib.err failed, errno 30
08/12/22 19:05:30 (pid:3194561) Open of lalinference_1000000000-1000437000.dag.lib.out failed, errno 30
08/12/22 19:05:30 (pid:3194561) Open of lalinference_1000000000-1000437000.dag.lib.err failed, errno 30
08/12/22 19:05:30 (pid:3194561) Open of /local/condor/spool/local_univ_execute/dir_82763974_0/.job.ad failed (Too many open files, errno=24).
08/12/22 19:05:30 (pid:3194561) Failed to read job environment: ERROR: Missing '=' after environment variable '-Wl,--sort-common'.
08/12/22 19:05:30 (pid:3194561) Create_Pipe(): call to pipe() failed
08/12/22 19:05:30 (pid:3194561) ERROR: Can't create DC pipe for writing job ClassAd to the shadow, aborting
08/12/22 19:05:30 (pid:3194561) Create_Pipe(): call to pipe() failed
08/12/22 19:05:30 (pid:3194561) ERROR: Can't create DC pipe for writing job ClassAd to the shadow, aborting
08/12/22 19:05:30 (pid:3194561) Create_Pipe(): call to pipe() failed
08/12/22 19:05:30 (pid:3194561) ERROR: Can't create DC pipe for writing job ClassAd to the shadow, aborting
08/12/22 19:05:30 (pid:3194561) Calling HandleReq <HandleChildAliveCommand> (0) for command 60008 (DC_CHILDALIVE) from condor@child <10.13.5.31:11654>
08/12/22 19:05:30 (pid:3194561) Return from HandleReq <HandleChildAliveCommand> (handler: 0.000007s, sec: 0.000s, payload: 0.000s)
**** PANIC -- OUT OF FILE DESCRIPTORS at line 227 in /var/lib/condor/execute/slot1/dir_89584/userdir/.tmp8t7PtB/BUILD/condor-9.0.15/src/condor_io/reli_sock.cpp
*** End of file SchedLog


On Aug 12, 2022, at 10:34 AM, Michael Thomas <wart@xxxxxxxxxxx> wrote:

We recently upgraded to condor 9.0.15 (which may or may not be relevant) and are now seeing some schedds reporting "too many open files", for example:

08/12/22 10:43:02 (pid:4627) Daemon::startCommand(INVALIDATE_SUBMITTOR_ADS,...) making connection to <10.13.5.25:9618?alias=ldas-condor.ldas.ligo-la.caltech.edu>
08/12/22 10:43:02 (pid:4627) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 24 (Too many open files)
08/12/22 10:43:02 (pid:4627) Can't open directory "/etc/condor/passwords.d" as PRIV_ROOT, errno: 24 (Too many open files)
08/12/22 10:43:02 (pid:4627) Can't open directory "/etc/condor/tokens.d" as PRIV_ROOT, errno: 24 (Too many open files)
08/12/22 10:43:02 (pid:4627) getTokenSigningKey(): read_secure_file(/etc/condor/condor_cred) failed!
08/12/22 10:43:02 (pid:4627) TOKEN: No token found.
08/12/22 10:43:02 (pid:4627) SECMAN: required authentication with collector ldas-condori failed, so aborting command INVALIDATE_SUBMITTOR_ADS.

I'm able to work around this by increasing the file descriptor limit on the schedd from the default of 4096 with:

SCHEDD_MAX_FILE_DESCRIPTORS = 32768

Looking in /proc/$pid/fd for the condor_schedd process, I see almost all open files are related to user .out, .err, and /dev/null fds from user dagman jobs.

Is it to be expected that there would be a lot of open files from dagman jobs?

--Mike
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Stuart Anderson
sba@xxxxxxxxxxx




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/