[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] help needed to troubleshoot a "SECMAN: FAILED" issue



Hi Zach and Mark,

thanks a lot for your help.
Let's see if I can answer the questions and add more info.

[1] The error messages in ScheddLog are there all the time. Not only
if I try condor_q.

05/29/20 04:40:56 (pid:11841) SECMAN: FAILED: Received "DENIED" from
server for user condor_pool@<a_domain_name> using method PASSWORD.
05/29/20 04:40:56 (pid:11841) Failed to send RESCHEDULE to negotiator
NEGOTIATOR: SECMAN:2010:Received "DENIED" from server for user
condor_pool@<a_domain_name> using method PASSWORD.

[2] When I try condor_q remotely, does not matter if as root or as un
unprivileged user, I get the same result:

$ condor_q -name <schedd_hostname>

-- Failed to fetch ads from:
<IP:9618?addrs=IP-9618&noUDP&sock=11649_d2d4_3> : <hostname>
CEDAR:6001:Failed to connect to <IP:9618?addrs=IP-9618&noUDP&sock=11649_d2d4_3>

[3] the schedd is indeed in the output of condor_status -schedd

[4] On the testing schedd

# condor_config_val ALLOW_READ
*/*.<a_domain_name>, */*.<the_schedd_domain_name>

[5] If I run condor_config_val ALLOW_WRITE_COLLECTOR on the central
manager, the only line in the output is <central_manager_hostname>
<central_manager_ip>

[6] condor_config_val ALLOW_DAEMON on the schedd gives me

condor_pool@<a_domain_name>/*.<a_domain_name>, <schedd_hostname>

[7] condor_config_val ALLOW_DAEMON on the central manager gives me

condor_pool@<a_domain_name>/*.<a_domain_name>,
<central_manager_hostname>, submit-side@matchsession

[8] the pool_password is indeed a copy from one of the production
schedds. So it is the same.

[9] I reserved a startd to test this schedd.
On that startd I have

# condor_config_val STARTD_ATTRS
 RalCluster, RalSnapshot, RalBranchName, RalBranchType, ScalingFactor,
StartJobs, ShouldHibernate, PREEMPTABLE_ONLY, StartJobs,
EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_ARCCE6
# condor_config_val ONLY_ARCCE6
True

and at the schedd I have this JOB_TRANSFORMATION rule

[
  ...
  set_Requirements = ( TARGET.ONLY_ARCCE6 );
  ...
]

However, when I submitted a job from the testing schedd, it never ran.
condor_q -better-analyze is like this


The Requirements expression for job 3.000 is like this:

    ( TARGET.ONLY_ARCCE6 )

Job 3.000 defines the following attributes:
The Requirements expression for job 3.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          25  TARGET.ONLY_ARCCE6

003.000:  Job has not yet been considered by the matchmaker.
    <-------- ???

003.000:  Run analysis summary ignoring user priority.  Of 13655 machines,
  13068 are rejected by your job's requirements
      0 reject your job because of their own requirements
    562 are exhausted partitionable slots
      0 match and are already running your jobs
     24 match but are serving other users
      1 are available to run your job                  <----- !!!


[10] from remote, as unprivileged user:

>>> import classad
>>> import htcondor
>>> coll = htcondor.Collector()
>>> results = coll.query(htcondor.AdTypes.Schedd, "true", ["Name"])
>>> print results
    # returns all schedds, including the testing one at the end of the list
>>> scheddAd = coll.locate(htcondor.DaemonTypes.Schedd, results[-1]["Name"])
>>> schedd = htcondor.Schedd(scheddAd)
>>> schedd.query("", ["JobStatus"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: Failed to fetch ads from schedd.


That is all the info I have at this time.
Thanks a lot for your help.
Cheers,
Jose


El vie., 29 may. 2020 a las 4:52, Zach Miller (<zmiller@xxxxxxxxxxx>) escribiÃ:
>
> Hi Jose,
>
> Sorry, I meant to respond to this earlier but Mark beat me to it so I just wanted to ask a couple questions and add a couple additions to Mark's email:
>
> It looks like you are running condor_q as root perhaps?  It (condor_q) seems to be using the pool password method of authentication and mapping you to the user "condor_pool@<domain>".  That's perhaps okay, depending on the ALLOW_* settings, but it's kind of unexpected.
>
> You mentioned two things -- adding a SchedD to a pool, and then running condor_q.  It sounds like you got the SchedD added just fine.  Is that correct? (Is it showing up when you run condor_status -schedd?)
>
> For the issue with condor_q you want to check the ALLOW_* settings on the SchedD machine.  Running condor_q requires READ-level authorization so you should run: "condor_config_ALLOW_READ".  For extra-nerdy goodness, use this regular expression:
>         # This will show you the allow settings for all subsystems and authorization levels
>         condor_config_val -dump '^(.*\.)?ALLOW'
>
> Basically, you need an entry in the ALLOW list that authorizes the user running condor_q, whoever that may be.  You may not care and just want to set "ALLOW_READ = *".  But let us know what results you get and we'll try to help!
>
>
> Cheers,
> -zach
>
> ïOn 5/28/20, 5:58 PM, "HTCondor-users on behalf of coatsworth@xxxxxxxxxxx" <htcondor-users-bounces@xxxxxxxxxxx on behalf of coatsworth@xxxxxxxxxxx> wrote:
>
>     Hi Jose, here are a few quick troubleshooting questions.
>     On your central manager machine, can you send the result of a `condor_config_val ALLOW_WRITE_COLLECTOR` command? Do you see your new schedd domain in this list?
>
>     On both your schedd and central manager machines, can you send the result of a `condor_config_val ALLOW_DAEMON`? Again, you see your new schedd domain in both those domains.
>
>     Can you confirm the pool password you set on the new schedd is the same as other schedds? If you're not sure, try copying the pool password file directly from another schedd that works.
>
>     Lastly I'm pretty sure that <a_domain_name> should be your schedd domain, not the production infrastructure.
>
>     Please give these a try and let me know, if they don't reveal the problem then I'll ask our security experts to weigh in :)
>
>     Mark
>
>
>
>
>
>
>     On Thu, May 28, 2020 at 9:29 AM <jcaballero.hep@xxxxxxxxx> wrote:
>
>
>     El jue., 28 may. 2020 a las 15:17, Jose Caballero
>     (<jcaballero.hep@xxxxxxxxx>) escribiÃ:
>     >
>     > Hi,
>     >
>     > I need some guidance here.
>     >
>     > I am trying to setup a testing Schedd and add it to an existing pool.
>     > It has the same configuration that the other Schedd's on production.
>     > However, there is a difference, my testing Schedd is on a host with a
>     > different domain name that the rest of the infrastructure. I feel that
>     > is part of the problem here.
>     >
>     > When I try to run condor_q remotely against the new test schedd, I get
>     > this in the SchedLog
>     >
>     > SECMAN: FAILED: Received "DENIED" from server for user
>     > condor_pool@<a_domain_name> using method PASSWORD.
>     >
>     > where the <a_domain_name> is the domain name of the production
>     > infrastructure, not the domain name of this testing schedd.
>     > Is that a problem?
>     >
>     > Extra info, let me know if there is something else I need to provide:
>     >
>     > ======================================
>     > # condor_config_val SEC_PASSWORD_FILE
>     > /etc/condor/pool_password
>     >
>     > # ls -l /etc/condor/pool_password
>     > -r-------- 1 root root 256 May 28 13:22 /etc/condor/pool_password
>     >
>     > # rpm -qa | grep condor
>     > condor-std-universe-8.6.13-1.el7.x86_64
>     > condor-8.6.13-1.el7.x86_64
>     > condor-procd-8.6.13-1.el7.x86_64
>     > condor-externals-8.6.13-1.el7.x86_64
>     > condor-external-libs-8.6.13-1.el7.x86_64
>     > condor-kbdd-8.6.13-1.el7.x86_64
>     > condor-cream-gahp-8.6.13-1.el7.x86_64
>     > condor-python-8.6.13-1.el7.x86_64
>     > condor-all-8.6.13-1.el7.x86_64
>     > condor-vm-gahp-8.6.13-1.el7.x86_64
>     > condor-bosco-8.6.13-1.el7.x86_64
>     > condor-classads-8.6.13-1.el7.x86_64
>     > ======================================
>     >
>     > Thanks a lot in advance.
>     > Cheers,
>     > Jose
>
>     An extra piece of info.
>     From the NegotiatorLog, replacing again real values by <foo>:
>
>     ======================================
>     05/28/20 15:06:39 PERMISSION DENIED to condor_pool@<a_domain_name>
>     from host <the_schedd_ip> for command 421 (Reschedule), access level
>     DAEMON: reason: cached result for DAEMON; see first case for the full
>     reason
>     05/28/20 15:06:39 DC_AUTHENTICATE: Command not authorized, done!
>     ======================================
>
>     _______________________________________________
>     HTCondor-users mailing list
>     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
>
>     --
>     Mark Coatsworth
>     Systems Programmer
>     Center for High Throughput Computing
>     Department of Computer Sciences
>     University of Wisconsin-Madison
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/