[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Inconsistent output of "condor_q -glo"?



Good morning,

after a major reconfig of our Hypatia cluster, with a couple of jobs having
been held before, I'm now getting somewhat inconsistent output from condor_q:

root@condormaster:.# condor_status -schedd
Name                                       Machine             RunningJobs   IdleJobs   HeldJobs

hypatia1.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia1.my.domain           0          0          0
hypatia2.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia2.my.domain           0          0        183
hypatia3.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia3.my.domain           0          0          0

                TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

              
         Total                 0                  0                183
root@condormaster:.# condor_q -schedd hypatia1.my.domain
All queues are empty
root@condormaster:.# condor_q -schedd hypatia2.my.domain
All queues are empty
root@condormaster:.# condor_q -schedd hypatia3.my.domain
All queues are empty

(same if I use "hypatia*.hypatia.local")

root@condormaster:.# condor_q -glo

-- Failed to fetch ads from: <10.150.100.102:4597?addrs=10.150.100.102-4597&alias=hypatia2.my.domain> : hypatia2.my.domain
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using FS
root@condormaster:.# 

I have compared the output of "condor_config_val -dump" for hypatia1 and hypatia2,
and see no difference (except the few machine-/IP-specific lines).
What's behind those AUTHENTICATE:100{3,4} failures?
In the ScheddLog, I see

DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXvkEMCP)

Since /tmp has permissions 1777, what causes the lstat() error?
Why does this only happen on one of three submit nodes?

# condor_version
$CondorVersion: 9.0.7 Nov 03 2021 BuildID: Debian-9.0.7-1+deb10u0 PackageID: 9.0.7-1+deb10u0 Debian-9.0.7-1+deb10u0 $
$CondorPlatform: X86_64-Debian_10 $


Thanks,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)my.domain
~~~