[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE: after 2 minutes Staging of job files failed



Hi Carles,

What happened when you increased the value of SEC_TCP_SESSION_DEADLINE
from 120 to 300? That seems like the obvious approach, I can't see any
other timeouts so it would seem to be on the socket. Did that have any
effect on the staging failures?

Also did you set this value only on the CE, or also on the LHCb
machine that is submitting the jobs?

If this is still a problem you could try setting SCHEDD_DEBUG =
D_FULLDEBUG, there's a lot of additional debug messaging in this code
that might give some hints.

Mark


On Sat, Jun 19, 2021 at 12:38 AM Carles Acosta <cacosta@xxxxxx> wrote:
>
> Dear all,
>
> We are running HTCondor-CE 3.4.3 in our HTCondor 8.8.13 cluster.
>
> There is an experiment, LHCb, that is submitting the pilots to the HTCondor-CE in batches of 100 jobs. From yesterday afternoon, we are seeing that the 100 jobs are submitted to the CE, they are on hold (waiting to spool input data) and then, the system starts to transfer data to the CE, everything ok, but always after 2 minutes:
>
> 06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.0
> 06/19/21 07:10:24 (cid:206664) Submitting new job 12437062.1
> [...]
> [...]
> 06/19/21 07:12:24 (cid:206668) generalJobFilesWorkerThread(): failed to transfer files for job 12437062.66
> 06/19/21 07:12:24 condor_write(): Socket closed when trying to write 29 bytes to <188.185.73.26:46488>, fd is 19
> 06/19/21 07:12:24 Buf::write(): condor_write() failed
> 06/19/21 07:12:24 Scheduler::spoolJobFilesWorkerThread(void *arg, Stream* s) NAP TIME
> 06/19/21 07:12:25 ERROR - Staging of job files failed!
>
> For example, in this case, we can see that when we arrived at job 66 of the batch everything fail, the 100 jobs are removed after the "Staging of jobs failed" error.
>
> We have increased the SEC_TCP_SESSION_DEADLINE from 120 seconds to 300 (as it was one of the first variables we thought related to this 2 minutes expiration).
>
> [root@ce14 ~]# condor_ce_config_val -dump | grep -i SEC_TCP
> SEC_TCP_SESSION_DEADLINE = 300
> SEC_TCP_SESSION_TIMEOUT = 20
>
> This issue happens in the past when they were submitting batches of 300 jobs, and after reducing to 100 everything was fine. I asked the user-list in the past and the suggestion was that this issue was related to the other side, the LHCb machine that is submitting the jobs, but I'm not sure since the issue seems to be just affecting us.
>
> Any ideas? We have look also at our TCP tunning settings, but nothing seems to be clearly related to this 2 minutes mystery.
>
> We have 2 HTCondor-CEs and we see the same behaviour in both. All the other experiments, Atlas, CMS, dune, LIGO, etc. are submitting and starting jobs without any issue.
>
> Thank you in advance.
>
> Cheers,
>
> Carles
>
>
> --
> Carles Acosta i Silva
> PIC (Port d'Informacià CientÃfica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 08
> Fax: +34 93 581 41 10
> http://www.pic.es
> AvÃs - Aviso - Legal Notice:  http://legal.ifae.es
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison