[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_off -peaceful -daemon master permissions check fail (BUG?)



Hi Brian,

What I wonder is whether sendig a first condor_off command to the startd and then another one to the master would work, given that since the DC_OFF_PEACEFUL has already arrived to the startd, the second SIGTERM from master should be mostly ignored... shouldn't it?

Joan

El 13/06/13 14:22, Brian Bockelman escribió:
Hi Joan,

A peaceful-off is a combination of two signals - 
a) A signal over the DaemonCore socket.  (this is not a RPC, so the master does not know if it succeeded)
b) A traditional unix signal.

As you saw, if HTCondor permissions are incorrect, it's possible (and locally, we've done this about a dozen times) to have (b) go through while (a) is ignored.  This results in a graceful off when a peaceful one was intended.

There are other issues with condor_off.  See: https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3686.

After being burned a few too many times, we either:
(a) SSH into each node individually and send the command locally, or
(b) Find a working combination via trial & error with a few worker nodes, then drain off the cluster.

I tend to grumble about condor_off constantly -- I'm hoping it'll get redesigned in the next year or so to be more robust.

Brian

On Jun 13, 2013, at 6:25 AM, "Joan J. Piles" <jpiles@xxxxxxxxx> wrote:

Hi all,

I don't know if this is a bug (I think it is), but there is a problem when you try to do a condor_off -peaceful -daemon master node from a central management machine.

When the condor master gets the peaceful shutdown command, it gets it from an authorized (as ADMINISTRATOR) machine. However, when it is to propagate this command to the children daemons, it does so as the local machine, which is not in the HOSTALLOW_ADMINISTRATOR list. We can see it in the log (172.16.4.103 is our management node, and 172.16.6.2 our test node):

MasterLog (trimmed, only relevant lines):

06/13/13 13:14:08 Received TCP command 60015 (DC_OFF_PEACEFUL) from unauthenticated@unmapped <172.16.4.103:46020>, access level ADMINISTRATOR
06/13/13 13:14:08 Calling HandleReq <handle_off_peaceful()> (0) for command 60015 (DC_OFF_PEACEFUL) from unauthenticated@unmapped <172.16.4.103:46020>
06/13/13 13:14:08 Got SIGTERM. Performing graceful shutdown.
06/13/13 13:14:08 Completed DC_SET_PEACEFUL_SHUTDOWN to local startd
06/13/13 13:14:14 Sent SIGTERM to STARTD (pid 31817)
06/13/13 13:14:14 The STARTD (pid 31817) exited with status 0
06/13/13 13:14:15 All daemons are gone.  Exiting.


Here, we see that the request comes from an authorized source. However, what the startd sees is subtly different, as the order is seen as coming from the local machine, which is not authorized:


StartLog:

06/13/13 13:14:08 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
06/13/13 13:14:08 PERMISSION DENIED to unauthenticated@unmapped from host 172.16.6.2 for command 60016 (DC_SET_PEACEFUL_SHUTDOWN), access level ADMINISTRATOR: reason: ADMINISTRATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 172.16.6.2,her06-02.hermes.cps.unizar.es,her06-02, hostname size = 2, original ip address = 172.16.6.2


As it later gets the sigterm:

06/13/13 13:14:14 Got SIGTERM. Performing graceful shutdown.
06/13/13 13:14:14 shutdown graceful
06/13/13 13:14:14 All resources are free, exiting.

The end result is that we get a graceful shutdown instead of the peaceful one we asked for.

An obvious workaround is to change:

HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)

to:

HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(FULL_HOSTNAME)

But since it's not the default policy, nor there is a clear reason why this should be so, I think it's more of a bug. condor_master should somehow authenticate as DAEMON, or pass on the credentials to startd.

When we do a condor_off -peaceful -daemon stard, however, everything works as expected since the shutdown command comes directly from the management machine.

Regards,

Joan


-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------