[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs held because they are not spooling data



Hello all,

We finally solved the issue. The issue was commented with Brian and Jaime in a GGUS ticket, but I would like to explain here how we did it for all the users of the mailing list. I'm sorry but the explanation is a bit long.Â

The problem was related to an old issue we saw some time ago and it is a consequence of our configuration using Argus server for auth/authz. When the GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION was not set to 0, we noticed that users with the same DN and different VO extension (Role) were mapped to the same user, however, the authz and mapping from the Argus server was always correct, so putting the GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION = 0 solves the issue.

As we commented in a previous email to this list (https://lists.cs.wisc.edu/archive/htcondor-users/2017-October/msg00008.shtml), disabling the gridmap cache leads to high load and memory problems with the condor_schedd of the HTCondor-CE, so it was not our preferred solution. Anyway, the major part of our users use always the same DN+Role and then the incorrect mapping occurred rarely.Â

On the other hand, we realized that the Atlas held jobs were incorrectly mapped (Role=production and mapped to our local Atlas analysis user for instance) because Atlas began to submit jobs using the same DN and different Role. Disabling the gridmap cache solved our problem with the held jobs.

Thus, looking to the reason why the cache was not working fine, we realized that, although Argus is the one who does the auth/authz, the HTCondor-CE needs the /etc/grid-security/vomsdir & /etc/vomses with the VOMS information. The lines in the SchedLog showed included_voms: 0.

ZKM: 2: mapret: 0 included_voms: 0 canonical_user: GSS_ASSIST_GRIDMAP

After adding the voms configuration, they changed to included_voms: 1 and the mapping is always correct although using the GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION to 30 minutes as default.Â

So, in summary, the HTCondor-CEs need the vomsdir configuration applied when using Argus server to enable the use of the GSS_ASSIST_GRIDMAP_CACHE_EXPIRATION variable.

Thank you very much.

Cheers,

Carles

On Fri, 26 Apr 2019 at 16:53, Carles Acosta <cacosta@xxxxxx> wrote:
Hi TJ,

Yes, some jobs are stuck forever in the hold state with HoldReason "Spooling input data files", the other ones are correctly released.

Carles

On Fri, 26 Apr 2019 at 16:35, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

All jobs that spool input data will spend time in the HOLD state with the hold reason âSpooling input data filesâ.

that hold should be released by the Schedd when spooling is complete.

Â

Are you saying that you have some jobs stuck in that state?

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Carles Acosta
Sent: Friday, April 26, 2019 3:32 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Aresh Vedaee (PIC) <avedaee@xxxxxx>
Subject: [HTCondor-users] Jobs held because they are not spooling data

Â

Hello all,

Â

We have HTCondor-CEs version 3.2.1 over HTCondor 8.8.1. WLCG experiments such as Atlas, CMS and LHCb are submitting there without any remarkable issue. However, since a few days ago, we have observed that some Atlas jobs are held with HoldReason "Spooling input data files". This happens for a fraction of all the Atlas jobs, not for all of them, and it is not clear if there is any difference between the jobs correct and the ones held, the jobs ClassAds are similar (same SubmitterId, proxy subject, etc).Â

Â

From Atlas side, they see:

Â

Error sending files to schedd ifaece04.pic.es: DCSchedd::spoolJobFiles:7002:File transfer failed for target job 1786552.0: Failed to receive GoAhead message from 193.109.175.10.

Â

From our side, we only see the jobs Held, with Hold Reason "Spooling input data file"s and the only information returned in AuditLog is that the job is submitted. On the other hand, the spool directory in the HTCondor-CE for the held jobs is correctly created, but empty and yes, there are no problems with the spool partition, it has enough size and the users can write them (take into account that all the other jobs are working).

Â

Any ideas? Or is there any way to increase the verbosity of AuditLog? We do not see anything useful in SchedLog file...

Â

Thank you in advance.

Â

Best regards,

Â

Carles

--

Carles Acosta i Silva

PIC (Port d'Informacià CientÃfica)

Campus UAB, Edifici D

E-08193 Bellaterra, Barcelona

Tel: +34 93 581 33 08

Fax: +34 93 581 41 10

AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html