[HTCondor-users] HTCondor-CE: after 2 minutes Staging of job files failed

Dear all,

We are running HTCondor-CE 3.4.3 in our HTCondor 8.8.13 cluster.

There is an experiment, LHCb, that is submitting the pilots to the HTCondor-CE in batches of 100 jobs. From yesterday afternoon, we are seeing that the 100 jobs are submitted to the CE, they are on hold (waiting to spool input data) and then, the system starts to transferÂdata to the CE, everything ok, but always after 2 minutes:

06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.0
06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.1
06/19/21 07:12:24 (cid:206668) generalJobFilesWorkerThread(): failed to transfer files for job 12437062.66
06/19/21 07:12:24 condor_write(): Socket closed when trying to write 29 bytes to <>, fd is 19
06/19/21 07:12:24 Buf::write(): condor_write() failed
06/19/21 07:12:24 Scheduler::spoolJobFilesWorkerThread(void *arg, Stream* s) NAP TIME
06/19/21 07:12:25 ERROR - Staging of job files failed!

For example, in this case, we can see that when we arrivedÂat job 66 of the batch everything fail, the 100 jobs are removed after the "Staging of jobs failed" error.Â

We have increased theÂSEC_TCP_SESSION_DEADLINE from 120 seconds to 300 (as it was one of the first variables we thought related to this 2 minutes expiration).Â

[root@ce14 ~]# condor_ce_config_val -dump | grep -i SEC_TCP

This issueÂhappens in the past when they were submitting batches of 300 jobs, and after reducing to 100 everything was fine. I asked the user-list in the past and the suggestion was that this issue was related to the other side, the LHCb machine that is submitting the jobs, but I'm not sure since the issue seems to be just affecting us.

Any ideas? We have look also at our TCP tunning settings, but nothing seems to be clearly related to this 2 minutes mystery.Â

We have 2 HTCondor-CEs and we see the same behaviour in both. All the other experiments, Atlas, CMS, dune, LIGO, etc. are submitting and starting jobs without any issue.

Thank you in advance.



