[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Inconsistent output of "condor_q -glo"?



On Fri, 2021-11-19 at 17:27:49 -0600, Todd Tannenbaum wrote:
> On 11/18/2021 2:50 AM, Steffen Grunewald wrote:
> > Good morning,
> Hi Steffen,
> 
> Some ideas inline below...
> 
> > after a major reconfig of our Hypatia cluster, with a couple of jobs having
> > been held before, I'm now getting somewhat inconsistent output from condor_q:
> > 
> > root@condormaster:.#  condor_status -schedd
> > Name                                       Machine             RunningJobs   IdleJobs   HeldJobs
> > 
> > hypatia1.hypatia.local@xxxxxxxxxxxxxxxxxx  hypatia1.my.domain           0          0          0
> > hypatia2.hypatia.local@xxxxxxxxxxxxxxxxxx  hypatia2.my.domain           0          0        183
> > hypatia3.hypatia.local@xxxxxxxxxxxxxxxxxx  hypatia3.my.domain           0          0          0
> > 
> >                  TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
> > 
> >           Total                 0                  0                183
> > root@condormaster:.#  condor_q -schedd hypatia1.my.domain
> > All queues are empty
> > root@condormaster:.#  condor_q -schedd hypatia2.my.domain
> > All queues are empty
> > root@condormaster:.#  condor_q -schedd hypatia3.my.domain
> > All queues are empty
> 
> For the above commands,  does the following work:
> 
>     condor_q -allusers -name hypatia2.my.domain
> 
> ?

This only works with the "long" name "hypatia1.hypatia.local@xxxxxxxxxxxxxxxxxx".

> 
> Note the use of "-name" instead of "-schedd" .... I think you wanted -name here.
> 
> Also by default, the schedd will only show the jobs owned by the user making
> the query.  Adding "-allusers" will give information for all users,
> regardless of who issued the condor_q command.
> 
> 
> > (same if I use "hypatia*.hypatia.local")
> > 
> > root@condormaster:.#  condor_q -glo
> > 
> > -- Failed to fetch ads from: <10.150.100.102:4597?addrs=10.150.100.102-4597&alias=hypatia2.my.domain> : hypatia2.my.domain
> > AUTHENTICATE:1003:Failed to authenticate with any method
> > AUTHENTICATE:1004:Failed to authenticate using FS
> > root@condormaster:.#
> > 
> > I have compared the output of "condor_config_val -dump" for hypatia1 and hypatia2,
> > and see no difference (except the few machine-/IP-specific lines).
> > What's behind those AUTHENTICATE:100{3,4} failures?
> 
> 
> So instead of the above, does the following command work:
> 
>     condor_q -global -allusers
> 
> ?

Indeed, this works, without a complaint.

> If adding the "-allusers' works, here is an explanation: the schedd will
> only show jobs owned by the user who issued condor_q.  To do this, the
> schedd needs to know who issued the condor_q command via authentication, and
> the error is likely a result of no authentication method that works over the
> network being configured. By adding "-allusers", the schedd does not need to
> know who issued the command, and can just return all the jobs (assuming the
> host has READ authorization).

I'll look into the authorization setup details. I took a rather naive path
when upgrading from 8.8 to 9.0, and the devil seemingly is in some detail.

Is it correct that this error was only reported *because* there were jobs
held on this schedd, not on the others?

> If you want "-allusers" to be the default whenever a condor_q command is
> issued, you can add the following to the condor_config:
> 
>      CONDOR_Q_ONLY_MY_JOBS = False

OK; perhaps I was wrong when I thought that root would be handled differently?

> Another way to handle this would be to allow READ access using CLAIMTOBE
> authentication.   CLAIMTOBE is not secure (the client can claim to be
> anybody), but the idea here is to only allow it for READ operations.  This
> would allow users to issue a condor_q command from a remote machine and
> still see only their jobs.
> 
> > In the ScheddLog, I see
> > 
> > DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXvkEMCP)
> > 
> > Since /tmp has permissions 1777, what causes the lstat() error?
> > 
> 
> You are issuing "condor_q" on machine A,  and it is trying to talk to a
> schedd on machine B.  The schedd is trying to authenticate the person who
> issued the "condor_q" command as explained above (unless you use
> -allusers).  The "FS" authentication method (for FileSystem authentication)
> works as follows:  the schedd asks the client to create a file in /tmp, and
> then the schedd does an lstat() on the file to read the file ownership and
> thus authenticate the identity of the person issuing condor_q.  This lstat()
> failed because /tmp is not shared between machine A and machine B, and thus
> the schedd is unable to lstat() the file it asked condor_q to create because
> the file is not there.

This apparently means FS authentication should be dropped?

> 
> Hope the above helps,
> regards
> Todd

Thanks, there lots of details to check, and many things to think about again!

Best,
- Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~