[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] help needed to troubleshoot a "SECMAN: FAILED" issue



Hello,

Thanks for the answers.  I partially misunderstood what was happening before but it is starting to make more sense.

FIRST ISSUE:  Adding the SchedD into the pool.  Looks like this worked based on results of [3] below.


SECOND ISSUE: The negotiator is not authorizing the "RESCHEDULE" message from the SchedD.  This is annoying but shouldn't actually stop the jobs from running.  This "PASSWORD" authentication is succeeding, so that's good.  On your negotiator machine you will want to look at the authorization settings.  Use my fancy regex from the previous email:
	condor_config_val -dump '^(.*\.)?ALLOW'

You want to double check "ALLOW_DAEMON".  That list needs to include an entry that allows your new SchedD.  Strangely, it's advertising to the collector just fine so I would have susepected this is already the case.  Do you have special settings just for the negotiator maybe?  Are you setting ALLOW_ADVERTISE_SCHEDD differently from ALLOW_DAEMON?

Quick and easy fix is to set ALLOW_DAEMON to "condor_pool@*" which means "anyone with the current pool password is authorized".  Or you can be a little more restrictive and add two entries: "condor_pool@*/*.<collector's_UID_DOMAIN>" and "condor_pool@*/<schedd.ip.or.hostname>".


THIRD ISSUE: Running condor_q remotely

I can't discern from the error message if it is an error (A) talking to the collector to find the schedd or (B) talking to the schedd to get the jobs.  I would suspect that talking to the collector was already working, so I'm guessing it's (B).

Can you run "condor_q -name <new.schedd>" again?  Set environment "_condor_TOOL_DEBUG = D_ALL:2" and then pass the "-debug" flag to condor_q.  This should give me a pretty good idea what's going on.  Thanks!

Double check the ALLOW_READ settings on both the collector and on your new SchedD.

Let us know how it goes.  Thanks!


Cheers,
-zach


ïOn 5/29/20, 2:57 AM, "HTCondor-users on behalf of jcaballero.hep@xxxxxxxxx" <htcondor-users-bounces@xxxxxxxxxxx on behalf of jcaballero.hep@xxxxxxxxx> wrote:

    Hi Zach and Mark,

    thanks a lot for your help.
    Let's see if I can answer the questions and add more info.

    [1] The error messages in ScheddLog are there all the time. Not only
    if I try condor_q.

    05/29/20 04:40:56 (pid:11841) SECMAN: FAILED: Received "DENIED" from
    server for user condor_pool@<a_domain_name> using method PASSWORD.
    05/29/20 04:40:56 (pid:11841) Failed to send RESCHEDULE to negotiator
    NEGOTIATOR: SECMAN:2010:Received "DENIED" from server for user
    condor_pool@<a_domain_name> using method PASSWORD.

    [2] When I try condor_q remotely, does not matter if as root or as un
    unprivileged user, I get the same result:

    $ condor_q -name <schedd_hostname>

    -- Failed to fetch ads from:
    <IP:9618?addrs=IP-9618&noUDP&sock=11649_d2d4_3> : <hostname>
    CEDAR:6001:Failed to connect to <IP:9618?addrs=IP-9618&noUDP&sock=11649_d2d4_3>

    [3] the schedd is indeed in the output of condor_status -schedd

    [4] On the testing schedd

    # condor_config_val ALLOW_READ
    */*.<a_domain_name>, */*.<the_schedd_domain_name>

    [5] If I run condor_config_val ALLOW_WRITE_COLLECTOR on the central
    manager, the only line in the output is <central_manager_hostname>
    <central_manager_ip>

    [6] condor_config_val ALLOW_DAEMON on the schedd gives me

    condor_pool@<a_domain_name>/*.<a_domain_name>, <schedd_hostname>

    [7] condor_config_val ALLOW_DAEMON on the central manager gives me

    condor_pool@<a_domain_name>/*.<a_domain_name>,
    <central_manager_hostname>, submit-side@matchsession

    [8] the pool_password is indeed a copy from one of the production
    schedds. So it is the same.

    [9] I reserved a startd to test this schedd.
    On that startd I have

    # condor_config_val STARTD_ATTRS
     RalCluster, RalSnapshot, RalBranchName, RalBranchType, ScalingFactor,
    StartJobs, ShouldHibernate, PREEMPTABLE_ONLY, StartJobs,
    EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_ARCCE6
    # condor_config_val ONLY_ARCCE6
    True

    and at the schedd I have this JOB_TRANSFORMATION rule

    [
      ...
      set_Requirements = ( TARGET.ONLY_ARCCE6 );
      ...
    ]

    However, when I submitted a job from the testing schedd, it never ran.
    condor_q -better-analyze is like this


    The Requirements expression for job 3.000 is like this:

        ( TARGET.ONLY_ARCCE6 )

    Job 3.000 defines the following attributes:
    The Requirements expression for job 3.000 reduces to these conditions:

             Slots
    Step    Matched  Condition
    -----  --------  ---------
    [0]          25  TARGET.ONLY_ARCCE6

    003.000:  Job has not yet been considered by the matchmaker.
        <-------- ???

    003.000:  Run analysis summary ignoring user priority.  Of 13655 machines,
      13068 are rejected by your job's requirements
          0 reject your job because of their own requirements
        562 are exhausted partitionable slots
          0 match and are already running your jobs
         24 match but are serving other users
          1 are available to run your job                  <----- !!!


    [10] from remote, as unprivileged user:

    >>> import classad
    >>> import htcondor
    >>> coll = htcondor.Collector()
    >>> results = coll.query(htcondor.AdTypes.Schedd, "true", ["Name"])
    >>> print results
        # returns all schedds, including the testing one at the end of the list
    >>> scheddAd = coll.locate(htcondor.DaemonTypes.Schedd, results[-1]["Name"])
    >>> schedd = htcondor.Schedd(scheddAd)
    >>> schedd.query("", ["JobStatus"])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IOError: Failed to fetch ads from schedd.


    That is all the info I have at this time.
    Thanks a lot for your help.
    Cheers,
    Jose


    El vie., 29 may. 2020 a las 4:52, Zach Miller (<zmiller@xxxxxxxxxxx>) escribiÃ:
    >
    > Hi Jose,
    >
    > Sorry, I meant to respond to this earlier but Mark beat me to it so I just wanted to ask a couple questions and add a couple additions to Mark's email:
    >
    > It looks like you are running condor_q as root perhaps?  It (condor_q) seems to be using the pool password method of authentication and mapping you to the user "condor_pool@<domain>".  That's perhaps okay, depending on the ALLOW_* settings, but it's kind of unexpected.
    >
    > You mentioned two things -- adding a SchedD to a pool, and then running condor_q.  It sounds like you got the SchedD added just fine.  Is that correct? (Is it showing up when you run condor_status -schedd?)
    >
    > For the issue with condor_q you want to check the ALLOW_* settings on the SchedD machine.  Running condor_q requires READ-level authorization so you should run: "condor_config_ALLOW_READ".  For extra-nerdy goodness, use this regular expression:
    >         # This will show you the allow settings for all subsystems and authorization levels
    >         condor_config_val -dump '^(.*\.)?ALLOW'
    >
    > Basically, you need an entry in the ALLOW list that authorizes the user running condor_q, whoever that may be.  You may not care and just want to set "ALLOW_READ = *".  But let us know what results you get and we'll try to help!
    >
    >
    > Cheers,
    > -zach
    >
    > On 5/28/20, 5:58 PM, "HTCondor-users on behalf of coatsworth@xxxxxxxxxxx" <htcondor-users-bounces@xxxxxxxxxxx on behalf of coatsworth@xxxxxxxxxxx> wrote:
    >
    >     Hi Jose, here are a few quick troubleshooting questions.
    >     On your central manager machine, can you send the result of a `condor_config_val ALLOW_WRITE_COLLECTOR` command? Do you see your new schedd domain in this list?
    >
    >     On both your schedd and central manager machines, can you send the result of a `condor_config_val ALLOW_DAEMON`? Again, you see your new schedd domain in both those domains.
    >
    >     Can you confirm the pool password you set on the new schedd is the same as other schedds? If you're not sure, try copying the pool password file directly from another schedd that works.
    >
    >     Lastly I'm pretty sure that <a_domain_name> should be your schedd domain, not the production infrastructure.
    >
    >     Please give these a try and let me know, if they don't reveal the problem then I'll ask our security experts to weigh in :)
    >
    >     Mark
    >
    >
    >
    >
    >
    >
    >     On Thu, May 28, 2020 at 9:29 AM <jcaballero.hep@xxxxxxxxx> wrote:
    >
    >
    >     El jue., 28 may. 2020 a las 15:17, Jose Caballero
    >     (<jcaballero.hep@xxxxxxxxx>) escribiÃ:
    >     >
    >     > Hi,
    >     >
    >     > I need some guidance here.
    >     >
    >     > I am trying to setup a testing Schedd and add it to an existing pool.
    >     > It has the same configuration that the other Schedd's on production.
    >     > However, there is a difference, my testing Schedd is on a host with a
    >     > different domain name that the rest of the infrastructure. I feel that
    >     > is part of the problem here.
    >     >
    >     > When I try to run condor_q remotely against the new test schedd, I get
    >     > this in the SchedLog
    >     >
    >     > SECMAN: FAILED: Received "DENIED" from server for user
    >     > condor_pool@<a_domain_name> using method PASSWORD.
    >     >
    >     > where the <a_domain_name> is the domain name of the production
    >     > infrastructure, not the domain name of this testing schedd.
    >     > Is that a problem?
    >     >
    >     > Extra info, let me know if there is something else I need to provide:
    >     >
    >     > ======================================
    >     > # condor_config_val SEC_PASSWORD_FILE
    >     > /etc/condor/pool_password
    >     >
    >     > # ls -l /etc/condor/pool_password
    >     > -r-------- 1 root root 256 May 28 13:22 /etc/condor/pool_password
    >     >
    >     > # rpm -qa | grep condor
    >     > condor-std-universe-8.6.13-1.el7.x86_64
    >     > condor-8.6.13-1.el7.x86_64
    >     > condor-procd-8.6.13-1.el7.x86_64
    >     > condor-externals-8.6.13-1.el7.x86_64
    >     > condor-external-libs-8.6.13-1.el7.x86_64
    >     > condor-kbdd-8.6.13-1.el7.x86_64
    >     > condor-cream-gahp-8.6.13-1.el7.x86_64
    >     > condor-python-8.6.13-1.el7.x86_64
    >     > condor-all-8.6.13-1.el7.x86_64
    >     > condor-vm-gahp-8.6.13-1.el7.x86_64
    >     > condor-bosco-8.6.13-1.el7.x86_64
    >     > condor-classads-8.6.13-1.el7.x86_64
    >     > ======================================
    >     >
    >     > Thanks a lot in advance.
    >     > Cheers,
    >     > Jose
    >
    >     An extra piece of info.
    >     From the NegotiatorLog, replacing again real values by <foo>:
    >
    >     ======================================
    >     05/28/20 15:06:39 PERMISSION DENIED to condor_pool@<a_domain_name>
    >     from host <the_schedd_ip> for command 421 (Reschedule), access level
    >     DAEMON: reason: cached result for DAEMON; see first case for the full
    >     reason
    >     05/28/20 15:06:39 DC_AUTHENTICATE: Command not authorized, done!
    >     ======================================
    >
    >     _______________________________________________
    >     HTCondor-users mailing list
    >     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
    >     subject: Unsubscribe
    >     You can also unsubscribe by visiting
    >     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    >
    >     The archives can be found at:
    >     https://lists.cs.wisc.edu/archive/htcondor-users/
    >
    >
    >
    >
    >     --
    >     Mark Coatsworth
    >     Systems Programmer
    >     Center for High Throughput Computing
    >     Department of Computer Sciences
    >     University of Wisconsin-Madison
    >
    >
    > _______________________________________________
    > HTCondor-users mailing list
    > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
    > subject: Unsubscribe
    > You can also unsubscribe by visiting
    > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    >
    > The archives can be found at:
    > https://lists.cs.wisc.edu/archive/htcondor-users/

    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/