[HTCondor-users] HTCondor-CE: after 2 minutes Staging of job files failed

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Dear all,

We are running HTCondor-CE 3.4.3 in our HTCondor 8.8.13 cluster.

There is an experiment, LHCb, that is submitting the pilots to the HTCondor-CE in batches of 100 jobs. From yesterday afternoon, we are seeing that the 100 jobs are submitted to the CE, they are on hold (waiting to spool input data) and then, the system starts to transferÂdata to the CE, everything ok, but always after 2 minutes:

06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.0

06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.1

[...]

06/19/21 07:12:24 (cid:206668) generalJobFilesWorkerThread(): failed to transfer files for job 12437062.66
06/19/21 07:12:24 condor_write(): Socket closed when trying to write 29 bytes to <188.185.73.26:46488>, fd is 19
06/19/21 07:12:24 Buf::write(): condor_write() failed
06/19/21 07:12:24 Scheduler::spoolJobFilesWorkerThread(void *arg, Stream* s) NAP TIME
06/19/21 07:12:25 ERROR - Staging of job files failed!

For example, in this case, we can see that when we arrivedÂat job 66 of the batch everything fail, the 100 jobs are removed after the "Staging of jobs failed" error.Â

We have increased theÂSEC_TCP_SESSION_DEADLINE from 120 seconds to 300 (as it was one of the first variables we thought related to this 2 minutes expiration).Â

[root@ce14 ~]# condor_ce_config_val -dump | grep -i SEC_TCP

SEC_TCP_SESSION_DEADLINE = 300
SEC_TCP_SESSION_TIMEOUT = 20

This issueÂhappens in the past when they were submitting batches of 300 jobs, and after reducing to 100 everything was fine. I asked the user-list in the past and the suggestion was that this issue was related to the other side, the LHCb machine that is submitting the jobs, but I'm not sure since the issue seems to be just affecting us.

Any ideas? We have look also at our TCP tunning settings, but nothing seems to be clearly related to this 2 minutes mystery.Â

We have 2 HTCondor-CEs and we see the same behaviour in both. All the other experiments, Atlas, CMS, dune, LIGO, etc. are submitting and starting jobs without any issue.

Thank you in advance.

Cheers,

Carles

Carles Acosta i Silva

PIC (Port d'InformaciÃ CientÃfica)

Campus UAB, Edifici D

E-08193 Bellaterra, Barcelona

Tel: +34 93 581 33 08

Fax: +34 93 581 41 10

http://www.pic.esÂ

AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es

Mailing List Archives

Public Access

[HTCondor-users] HTCondor-CE: after 2 minutes Staging of job files failed