[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE: after 2 minutes Staging of job files failed



Dear all,

We are running HTCondor-CE 3.4.3 in our HTCondor 8.8.13 cluster.

There is an experiment, LHCb, that is submitting the pilots to the HTCondor-CE in batches of 100 jobs. From yesterday afternoon, we are seeing that the 100 jobs are submitted to the CE, they are on hold (waiting to spool input data) and then, the system starts to transferÂdata to the CE, everything ok, but always after 2 minutes:

06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.0
06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.1
[...]
[...]
06/19/21 07:12:24 (cid:206668) generalJobFilesWorkerThread(): failed to transfer files for job 12437062.66
06/19/21 07:12:24 condor_write(): Socket closed when trying to write 29 bytes to <188.185.73.26:46488>, fd is 19
06/19/21 07:12:24 Buf::write(): condor_write() failed
06/19/21 07:12:24 Scheduler::spoolJobFilesWorkerThread(void *arg, Stream* s) NAP TIME
06/19/21 07:12:25 ERROR - Staging of job files failed!

For example, in this case, we can see that when we arrivedÂat job 66 of the batch everything fail, the 100 jobs are removed after the "Staging of jobs failed" error.Â

We have increased theÂSEC_TCP_SESSION_DEADLINE from 120 seconds to 300 (as it was one of the first variables we thought related to this 2 minutes expiration).Â

[root@ce14 ~]# condor_ce_config_val -dump | grep -i SEC_TCP
SEC_TCP_SESSION_DEADLINE = 300
SEC_TCP_SESSION_TIMEOUT = 20

This issueÂhappens in the past when they were submitting batches of 300 jobs, and after reducing to 100 everything was fine. I asked the user-list in the past and the suggestion was that this issue was related to the other side, the LHCb machine that is submitting the jobs, but I'm not sure since the issue seems to be just affecting us.

Any ideas? We have look also at our TCP tunning settings, but nothing seems to be clearly related to this 2 minutes mystery.Â

We have 2 HTCondor-CEs and we see the same behaviour in both. All the other experiments, Atlas, CMS, dune, LIGO, etc. are submitting and starting jobs without any issue.

Thank you in advance.

Cheers,

Carles


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es