[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs held because they are not spooling data



Hi TJ,

Yes, some jobs are stuck forever in the hold state with HoldReason "Spooling input data files", the other ones are correctly released.

Carles

On Fri, 26 Apr 2019 at 16:35, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

All jobs that spool input data will spend time in the HOLD state with the hold reason âSpooling input data filesâ.

that hold should be released by the Schedd when spooling is complete.

Â

Are you saying that you have some jobs stuck in that state?

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Carles Acosta
Sent: Friday, April 26, 2019 3:32 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Aresh Vedaee (PIC) <avedaee@xxxxxx>
Subject: [HTCondor-users] Jobs held because they are not spooling data

Â

Hello all,

Â

We have HTCondor-CEs version 3.2.1 over HTCondor 8.8.1. WLCG experiments such as Atlas, CMS and LHCb are submitting there without any remarkable issue. However, since a few days ago, we have observed that some Atlas jobs are held with HoldReason "Spooling input data files". This happens for a fraction of all the Atlas jobs, not for all of them, and it is not clear if there is any difference between the jobs correct and the ones held, the jobs ClassAds are similar (same SubmitterId, proxy subject, etc).Â

Â

From Atlas side, they see:

Â

Error sending files to schedd ifaece04.pic.es: DCSchedd::spoolJobFiles:7002:File transfer failed for target job 1786552.0: Failed to receive GoAhead message from 193.109.175.10.

Â

From our side, we only see the jobs Held, with Hold Reason "Spooling input data file"s and the only information returned in AuditLog is that the job is submitted. On the other hand, the spool directory in the HTCondor-CE for the held jobs is correctly created, but empty and yes, there are no problems with the spool partition, it has enough size and the users can write them (take into account that all the other jobs are working).

Â

Any ideas? Or is there any way to increase the verbosity of AuditLog? We do not see anything useful in SchedLog file...

Â

Thank you in advance.

Â

Best regards,

Â

Carles

--

Carles Acosta i Silva

PIC (Port d'Informacià CientÃfica)

Campus UAB, Edifici D

E-08193 Bellaterra, Barcelona

Tel: +34 93 581 33 08

Fax: +34 93 581 41 10

AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html