We have HTCondor-CEs version 3.2.1 over HTCondor 8.8.1. WLCG experiments such as Atlas, CMS and LHCb are submitting there without any remarkable issue. However, since a few days ago, we have observed that some Atlas jobs are held with HoldReason "Spooling input data files". This happens for a fraction of all the Atlas jobs, not for all of them, and it is not clear if there is any difference between the jobs correct and the ones held, the jobs ClassAds are similar (same SubmitterId, proxy subject, etc).Â
From Atlas side, they see:
Error sending files to schedd ifaece04.pic.es
: DCSchedd::spoolJobFiles:7002:File transfer failed for target job 1786552.0: Failed to receive GoAhead message from 184.108.40.206.
From our side, we only see the jobs Held, with Hold Reason "Spooling input data file"s and the only information returned in AuditLog is that the job is submitted. On the other hand, the spool directory in the HTCondor-CE for the held jobs is correctly created, but empty and yes, there are no problems with the spool partition, it has enough size and the users can write them (take into account that all the other jobs are working).
Any ideas? Or is there any way to increase the verbosity of AuditLog? We do not see anything useful in SchedLog file...
Thank you in advance.
Carles Acosta i Silva
PIC (Port d'InformaciÃ CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10