All jobs that spool input data will spend time in the HOLD state with the hold reason âSpooling input data filesâ.
that hold should be released by the Schedd when spooling is complete.
Are you saying that you have some jobs stuck in that state?
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Carles Acosta
We have HTCondor-CEs version 3.2.1 over HTCondor 8.8.1. WLCG experiments such as Atlas, CMS and LHCb are submitting there without any remarkable issue. However, since a few days ago, we have observed that some Atlas jobs are held with HoldReason "Spooling input data files". This happens for a fraction of all the Atlas jobs, not for all of them, and it is not clear if there is any difference between the jobs correct and the ones held, the jobs ClassAds are similar (same SubmitterId, proxy subject, etc).
From Atlas side, they see:
Error sending files to schedd ifaece04.pic.es: DCSchedd::spoolJobFiles:7002:File transfer failed for target job 1786552.0: Failed to receive GoAhead message from 188.8.131.52.
From our side, we only see the jobs Held, with Hold Reason "Spooling input data file"s and the only information returned in AuditLog is that the job is submitted. On the other hand, the spool directory in the HTCondor-CE for the held jobs is correctly created, but empty and yes, there are no problems with the spool partition, it has enough size and the users can write them (take into account that all the other jobs are working).
Any ideas? Or is there any way to increase the verbosity of AuditLog? We do not see anything useful in SchedLog file...
Thank you in advance.