[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Intermittent submission failures to HTCondor-CE



Hi,

Adding the GSI_AUTHENTICATION_TIMEOUT config knob did not solve the issue. We are still seeing from time to time, around 1 hour (not exactly!), these messages:

DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1006:exceeded XXXXX deadline during authentication|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXsvyHJB)

Although the major part of the time the jobs submitted from these machines are correctly accepted and mapped in our CE. Furthermore, we have another CE with the same configuration and HTCondor versions that does not show these issues, all the jobs are always correctly accepted. Any other ideas? What can generate these intermittent authentication errors? As it does not seems an issue with the HTCondor-CE or HTCondor version or configuration, do you know what system packages or maybe network or TCP sysctl parameters can be related to this deadline during authentication?

Thank you very much again.

Cheers,

Carles

On Sat, 17 Sept 2022 at 15:06, Carles Acosta <cacosta@xxxxxx> wrote:
Hi Cole,

Thank you very much. Our time report in the CE log is in CEST (+2 hours UTC) so, it is 5 sec in total. The GSI_AUTHENTICATION_TIMEOUT is not set in our CEs, I'm going to check.

Cheers,

Carles

On Fri, 16 Sept 2022 at 21:06, Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi Carles,

I am not sure if I am messing somethingÂup with the epoch conversion, but I 1663300416 becoming 9/16/22 3:53:41 which would imply a 2 hr. and 5 sec timeout. That doesn't seem right to me so you may want to confirm that. Otherwise, I did find a config knob called GSI_AUTHENTICATION_TIMEOUT. I would first check to see if this is set with condor_config_val and then maybe give it a try.

Best,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
Sent: Friday, September 16, 2022 1:00 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Intermittent submission failures to HTCondor-CE
Â
Hello again,

We have finally foundÂthe error message in our CE:

09/16/22 05:53:41 (cid:630527) Command=QMGMT_WRITE_CMD, peer=<XXXXX:YYY>
09/16/22 05:53:41 (cid:630527) Authentication Failed, MethodsTried=FS,TOKEN,SCITOKENS,GSI,SSL
09/16/22 05:53:41 DC_AUTHENTICATE: authentication of <XXXXX:YYY> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting.
09/16/22 05:53:41 DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1006:exceeded 1663300416 deadline during authentication|AUTHENTICATE:1004:Failed to authenticate using IDTOKENS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXsvyHJB)

The "exceeded 1663300416 deadline during authentication", 1663300416 is 5 seconds before 09/16/22 05:53:41. Thus, I understand that the authentication took more than 5 seconds and then failed, right? This does not happen for our other CE (same version and configs); it just happensÂfrom time to time. Is there any way to increment the 5 seconds deadline? We use GSI authentication against an Argus, can this be related? Or with theÂGSS_ASSIST_GRIDMAP_CACHE_EXPIRATION?

Thank you again.

Best regards,

Carles


On Mon, 12 Sept 2022 at 16:38, Carles Acosta <cacosta@xxxxxx> wrote:
Dear all,

We have a strange issue regarding our HTCondor-CEs.Â

LHCb experiment is experiencing intermittent submission errors to our CE:

Pilot submission failed with error: ERROR: Failed to connect to queue manager ce14.pic.es
AUTHENTICATE:1005:Failed to securely exchange session key
AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
AUTHENTICATE:1004:Failed to authenticate using FS Â

But otherÂtimes, everything works fine and the submissions are correct. Furthermore, there is another CE where everything is always ok and shares the same version and general configuration as ce14.pic.es. Both CEs are running condor 9.0.16 and HTCondor-CE version 5.1.5.Â

We do not see any error in the CE logs that explain this behavior. The experiment is authenticated correctly through GSI when the submission is ok in ce14 and always in the other CE. Any ideas? I really do not know how to debug this issue since I do not see any error in the CE log.

Thank you in advance.

Best regards,

Carles
--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es