[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_off -peaceful -daemon master permissions check fail (BUG?)



On Thu, Jun 13, 2013 at 03:09:30PM +0200, Joan J. Piles wrote:
> Hi Brian,
> 
> What I wonder is whether sendig a first condor_off command to the startd and
> then another one to the master would work, given that since the DC_OFF_PEACEFUL
> has already arrived to the startd, the second SIGTERM from master should be
> mostly ignored... shouldn't it?


This should in fact work in your case:
	condor_off -peaceful -startd REMOTEHOST
	condor_off -peaceful -master REMOTEHOST


However, as Brian mentioned, I was debugging other problems earlier this week
such that if REMOTEHOST is not an actual DNS name, then:
	condor_off -peaceful -startd STARTDNAME

doesn't work.  You can then use:
	condor_off -peaceful -startd -addr <rem.ote.ho.st:port>
	condor_off -peaceful -master -addr <rem.ote.ho.st:port>
	
to communcate directly with the daemons, but that is quite clunky.


Our proposed solution is partially mentioned in that ticket:
	http://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3686

which is to essentially do what you suggested, and only ever communicate with
the condor_master on a given machine, and change the authorization levels so
that the master can inform its children about the way in which they're being
shut down.


This is a big enough change that we're planning it for the development series,
and not stable.  Currently it's targeted for 8.1.0.  I'll also add your notes
into the ticket.


Cheers,
-zach



> Joan
> 
> El 13/06/13 14:22, Brian Bockelman escribió:
> 
>     Hi Joan,
> 
>     A peaceful-off is a combination of two signals - 
>     a) A signal over the DaemonCore socket.  (this is not a RPC, so the master
>     does not know if it succeeded)
>     b) A traditional unix signal.
> 
>     As you saw, if HTCondor permissions are incorrect, it's possible (and
>     locally, we've done this about a dozen times) to have (b) go through while
>     (a) is ignored.  This results in a graceful off when a peaceful one was
>     intended.
> 
>     There are other issues with condor_off.  See: https://
>     htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3686.
> 
>     After being burned a few too many times, we either:
>     (a) SSH into each node individually and send the command locally, or
>     (b) Find a working combination via trial & error with a few worker nodes,
>     then drain off the cluster.
> 
>     I tend to grumble about condor_off constantly -- I'm hoping it'll get
>     redesigned in the next year or so to be more robust.
> 
>     Brian
> 
>     On Jun 13, 2013, at 6:25 AM, "Joan J. Piles" <jpiles@xxxxxxxxx> wrote:
> 
> 
>         Hi all,
> 
>         I don't know if this is a bug (I think it is), but there is a problem
>         when you try to do a condor_off -peaceful -daemon master node from a
>         central management machine.
> 
>         When the condor master gets the peaceful shutdown command, it gets it
>         from an authorized (as ADMINISTRATOR) machine. However, when it is to
>         propagate this command to the children daemons, it does so as the local
>         machine, which is not in the HOSTALLOW_ADMINISTRATOR list. We can see
>         it in the log (172.16.4.103 is our management node, and 172.16.6.2 our
>         test node):
> 
>         MasterLog (trimmed, only relevant lines):
> 
> 
>             06/13/13 13:14:08 Received TCP command 60015 (DC_OFF_PEACEFUL) from
>             unauthenticated@unmapped <172.16.4.103:46020>, access level
>             ADMINISTRATOR
>             06/13/13 13:14:08 Calling HandleReq <handle_off_peaceful()> (0) for
>             command 60015 (DC_OFF_PEACEFUL) from unauthenticated@unmapped
>             <172.16.4.103:46020>
>             06/13/13 13:14:08 Got SIGTERM. Performing graceful shutdown.
>             06/13/13 13:14:08 Completed DC_SET_PEACEFUL_SHUTDOWN to local
>             startd
>             06/13/13 13:14:14 Sent SIGTERM to STARTD (pid 31817)
>             06/13/13 13:14:14 The STARTD (pid 31817) exited with status 0
>             06/13/13 13:14:15 All daemons are gone.  Exiting.
> 
> 
> 
>         Here, we see that the request comes from an authorized source. However,
>         what the startd sees is subtly different, as the order is seen as
>         coming from the local machine, which is not authorized:
> 
> 
>         StartLog:
> 
> 
>             06/13/13 13:14:08 Calling Handler
>             <DaemonCommandProtocol::WaitForSocketData> (2)
>             06/13/13 13:14:08 PERMISSION DENIED to unauthenticated@unmapped
>             from host 172.16.6.2 for command 60016 (DC_SET_PEACEFUL_SHUTDOWN),
>             access level ADMINISTRATOR: reason: ADMINISTRATOR authorization
>             policy contains no matching ALLOW entry for this request;
>             identifiers used for this host: 172.16.6.2,her06-02.
>             hermes.cps.unizar.es,her06-02, hostname size = 2, original ip
>             address = 172.16.6.2
> 
> 
> 
>         As it later gets the sigterm:
> 
> 
>             06/13/13 13:14:14 Got SIGTERM. Performing graceful shutdown.
>             06/13/13 13:14:14 shutdown graceful
>             06/13/13 13:14:14 All resources are free, exiting.
> 
> 
>         The end result is that we get a graceful shutdown instead of the
>         peaceful one we asked for.
> 
>         An obvious workaround is to change:
> 
> 
>             HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
> 
> 
>         to:
> 
> 
>             HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(FULL_HOSTNAME)
> 
> 
>         But since it's not the default policy, nor there is a clear reason why
>         this should be so, I think it's more of a bug. condor_master should
>         somehow authenticate as DAEMON, or pass on the credentials to startd.
> 
>         When we do a condor_off -peaceful -daemon stard, however, everything
>         works as expected since the shutdown command comes directly from the
>         management machine.
> 
>         Regards,
> 
>         Joan
> 
> 
> 
>         --
>         --------------------------------------------------------------------------
>         Joan Josep Piles Contreras -  Analista de sistemas
>         I3A - Instituto de Investigación en Ingeniería de Aragón
>         Tel: 876 55 51 47 (ext. 845147)
>         http://i3a.unizar.es -- jpiles@xxxxxxxxx
>         --------------------------------------------------------------------------
> 
>         _______________________________________________
>         HTCondor-users mailing list
>         To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>         with a
>         subject: Unsubscribe
>         You can also unsubscribe by visiting
>         https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
>         The archives can be found at:
>         https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> 
>     _______________________________________________
>     HTCondor-users mailing list
>     To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>     subject: Unsubscribe
>     You can also unsubscribe by visiting
>     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
>     The archives can be found at:
>     https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> 
> --
> --------------------------------------------------------------------------
> Joan Josep Piles Contreras -  Analista de sistemas
> I3A - Instituto de Investigación en Ingeniería de Aragón
> Tel: 876 55 51 47 (ext. 845147)
> http://i3a.unizar.es -- jpiles@xxxxxxxxx
> --------------------------------------------------------------------------
> 

> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/