[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Inconsistent output of "condor_q -glo"?



On 11/18/2021 2:50 AM, Steffen Grunewald wrote:
Good morning,
Hi Steffen,

Some ideas inline below...

after a major reconfig of our Hypatia cluster, with a couple of jobs having
been held before, I'm now getting somewhat inconsistent output from condor_q:

root@condormaster:.# condor_status -schedd
Name                                       Machine             RunningJobs   IdleJobs   HeldJobs

hypatia1.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia1.my.domain           0          0          0
hypatia2.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia2.my.domain           0          0        183
hypatia3.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia3.my.domain           0          0          0

                TotalRunningJobs      TotalIdleJobs      TotalHeldJobs

              
         Total                 0                  0                183
root@condormaster:.# condor_q -schedd hypatia1.my.domain
All queues are empty
root@condormaster:.# condor_q -schedd hypatia2.my.domain
All queues are empty
root@condormaster:.# condor_q -schedd hypatia3.my.domain
All queues are empty

For the above commands,  does the following work:

    condor_q -allusers -name hypatia2.my.domain

?

Note the use of "-name" instead of "-schedd" .... I think you wanted -name here.

Also by default, the schedd will only show the jobs owned by the user making the query.  Adding "-allusers" will give information for all users, regardless of who issued the condor_q command.


(same if I use "hypatia*.hypatia.local")

root@condormaster:.# condor_q -glo

-- Failed to fetch ads from: <10.150.100.102:4597?addrs=10.150.100.102-4597&alias=hypatia2.my.domain> : hypatia2.my.domain
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using FS
root@condormaster:.# 

I have compared the output of "condor_config_val -dump" for hypatia1 and hypatia2,
and see no difference (except the few machine-/IP-specific lines).
What's behind those AUTHENTICATE:100{3,4} failures?


So instead of the above, does the following command work:

    condor_q -global -allusers

?

If adding the "-allusers' works, here is an explanation: the schedd will only show jobs owned by the user who issued condor_q.  To do this, the schedd needs to know who issued the condor_q command via authentication, and the error is likely a result of no authentication method that works over the network being configured.  By adding "-allusers", the schedd does not need to know who issued the command, and can just return all the jobs (assuming the host has READ authorization). 

If you want "-allusers" to be the default whenever a condor_q command is issued, you can add the following to the condor_config:

     CONDOR_Q_ONLY_MY_JOBS = False

Another way to handle this would be to allow READ access using CLAIMTOBE authentication.   CLAIMTOBE is not secure (the client can claim to be anybody), but the idea here is to only allow it for READ operations.  This would allow users to issue a condor_q command from a remote machine and still see only their jobs. 

In the ScheddLog, I see

DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXvkEMCP)

Since /tmp has permissions 1777, what causes the lstat() error?


You are issuing "condor_q" on machine A,  and it is trying to talk to a schedd on machine B.  The schedd is trying to authenticate the person who issued the "condor_q" command as explained above (unless you use -allusers).  The "FS" authentication method (for FileSystem authentication) works as follows:  the schedd asks the client to create a file in /tmp, and then the schedd does an lstat() on the file to read the file ownership and thus authenticate the identity of the person issuing condor_q.  This lstat() failed because /tmp is not shared between machine A and machine B, and thus the schedd is unable to lstat() the file it asked condor_q to create because the file is not there.

Hope the above helps,
regards
Todd