Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Inconsistent output of "condor_q -glo"?
- Date: Mon, 22 Nov 2021 10:34:23 +0100
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Inconsistent output of "condor_q -glo"?
On Fri, 2021-11-19 at 17:27:49 -0600, Todd Tannenbaum wrote:
> On 11/18/2021 2:50 AM, Steffen Grunewald wrote:
> > Good morning,
> Hi Steffen,
>
> Some ideas inline below...
>
> > after a major reconfig of our Hypatia cluster, with a couple of jobs having
> > been held before, I'm now getting somewhat inconsistent output from condor_q:
> >
> > root@condormaster:.# condor_status -schedd
> > Name Machine RunningJobs IdleJobs HeldJobs
> >
> > hypatia1.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia1.my.domain 0 0 0
> > hypatia2.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia2.my.domain 0 0 183
> > hypatia3.hypatia.local@xxxxxxxxxxxxxxxxxx hypatia3.my.domain 0 0 0
> >
> > TotalRunningJobs TotalIdleJobs TotalHeldJobs
> >
> > Total 0 0 183
> > root@condormaster:.# condor_q -schedd hypatia1.my.domain
> > All queues are empty
> > root@condormaster:.# condor_q -schedd hypatia2.my.domain
> > All queues are empty
> > root@condormaster:.# condor_q -schedd hypatia3.my.domain
> > All queues are empty
>
> For the above commands, does the following work:
>
> condor_q -allusers -name hypatia2.my.domain
>
> ?
This only works with the "long" name "hypatia1.hypatia.local@xxxxxxxxxxxxxxxxxx".
>
> Note the use of "-name" instead of "-schedd" .... I think you wanted -name here.
>
> Also by default, the schedd will only show the jobs owned by the user making
> the query. Adding "-allusers" will give information for all users,
> regardless of who issued the condor_q command.
>
>
> > (same if I use "hypatia*.hypatia.local")
> >
> > root@condormaster:.# condor_q -glo
> >
> > -- Failed to fetch ads from: <10.150.100.102:4597?addrs=10.150.100.102-4597&alias=hypatia2.my.domain> : hypatia2.my.domain
> > AUTHENTICATE:1003:Failed to authenticate with any method
> > AUTHENTICATE:1004:Failed to authenticate using FS
> > root@condormaster:.#
> >
> > I have compared the output of "condor_config_val -dump" for hypatia1 and hypatia2,
> > and see no difference (except the few machine-/IP-specific lines).
> > What's behind those AUTHENTICATE:100{3,4} failures?
>
>
> So instead of the above, does the following command work:
>
> condor_q -global -allusers
>
> ?
Indeed, this works, without a complaint.
> If adding the "-allusers' works, here is an explanation: the schedd will
> only show jobs owned by the user who issued condor_q. To do this, the
> schedd needs to know who issued the condor_q command via authentication, and
> the error is likely a result of no authentication method that works over the
> network being configured. By adding "-allusers", the schedd does not need to
> know who issued the command, and can just return all the jobs (assuming the
> host has READ authorization).
I'll look into the authorization setup details. I took a rather naive path
when upgrading from 8.8 to 9.0, and the devil seemingly is in some detail.
Is it correct that this error was only reported *because* there were jobs
held on this schedd, not on the others?
> If you want "-allusers" to be the default whenever a condor_q command is
> issued, you can add the following to the condor_config:
>
> CONDOR_Q_ONLY_MY_JOBS = False
OK; perhaps I was wrong when I thought that root would be handled differently?
> Another way to handle this would be to allow READ access using CLAIMTOBE
> authentication. CLAIMTOBE is not secure (the client can claim to be
> anybody), but the idea here is to only allow it for READ operations. This
> would allow users to issue a condor_q command from a remote machine and
> still see only their jobs.
>
> > In the ScheddLog, I see
> >
> > DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXvkEMCP)
> >
> > Since /tmp has permissions 1777, what causes the lstat() error?
> >
>
> You are issuing "condor_q" on machine A, and it is trying to talk to a
> schedd on machine B. The schedd is trying to authenticate the person who
> issued the "condor_q" command as explained above (unless you use
> -allusers). The "FS" authentication method (for FileSystem authentication)
> works as follows: the schedd asks the client to create a file in /tmp, and
> then the schedd does an lstat() on the file to read the file ownership and
> thus authenticate the identity of the person issuing condor_q. This lstat()
> failed because /tmp is not shared between machine A and machine B, and thus
> the schedd is unable to lstat() the file it asked condor_q to create because
> the file is not there.
This apparently means FS authentication should be dropped?
>
> Hope the above helps,
> regards
> Todd
Thanks, there lots of details to check, and many things to think about again!
Best,
- Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~