[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] How to debug [held job due to expired user proxy]




Hi group,

First of all, my setup is:

HTcondor CE + Batch

$CondorVersion: 8.8.5 Sep 04 2019 BuildID: 480168 PackageID: 8.8.5-1 $
$CondorPlatform: x86_64_RedHat7

25 workernodes with 40 cores each (1000 cores in total)

--------

A lot of jobs are going into a hold state with this message: "held job due to expired user proxy".


Sometimes I have more than 3k hold jobs, and this makes my CE unavailable until I remove them from the queue:

condor_ce_q

-- Failed to fetch ads from: <10.10.0.10:17685> : <myce-fqdn>
SECMAN:2007:Failed to end classad message.

----------

Here goes attached a graph of my CE, which is a condor_ce_q and condor_status --total

As you can see fails from time to time.

From where should I start to debug ?


Regards,

EJ

Attachment: Screen Shot 2020-08-05 at 21.25.33.png
Description: PNG image