[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"




3/31 16:00:30 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:34627> for command 455 (DAEMONS_ON), access level ADMINISTRATOR


The command that is being rejected requires administrator access. I assume you do not (nor should you) have the following configuration:

HOSTALLOW_ADMINISTRATOR = *

Instead, this is typically cofigured to only allow administrative access from some trusted host where ordinary users do not have access. Using some form of strong user authentication (shuch as GSI) is another option. However, if you jsut want to authenticate trusted administrative users on each local machine, you can do that with FS authentication. Example:

# Authenticate administrative access so we can see if it
# is an administrative account local to this machine.  If you
# don't allow remote administrative commands (such as condor_reconfig -all)
# or all remote administrative commands are required to be
# authenticated via some remote authentication method such as GSI,
# then you could instead set this to REQUIRED.
SEC_ADMINISTRATOR_AUTHENTICATION = PREFERRED

ALLOW_ADMINISTRATOR = \
 root@$(UID_DOMAIN)/$(FULL_HOSTNAME) \
 condor@$(UID_DOMAIN)/$(FULL_HOSTNAME)


Is this enabled by default in Condor v7?


No it is not, at least not in the official distribution of Condor from the Condor web site. This mechanism is the pool password authentication method. It is only used to authenticate Condor daemons to each other. It cannot be used to authenticate users (or admins) to Condor. More info here:

http://www.cs.wisc.edu/condor/manual/v7.0/3_6Security.html#SECTION00463400000000000000

--Dan

Ian Stokes-Rees wrote:

I think there are multiple compounding problems here, and the disappearance of lock files is just one part (this may be due to tmpwatch). Looking in the MasterLog file on the worker node, I have the following lines:

3/31 15:23:51 Started DaemonCore process "/se/app/shared/condor/sbin/condor_startd", pid and pgroup = 26507
3/31 15:26:34 The STARTD (pid 26507) exited with status 4
3/31 15:26:34 restarting /se/app/shared/condor/sbin/condor_startd in 3600 seconds 3/31 16:00:24 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:41217> for command 454 (DAEMONS_OFF), access level ADMINISTRATOR 3/31 16:00:30 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:34627> for command 455 (DAEMONS_ON), access level ADMINISTRATOR

I have pretty much turned off any access control via the HOSTALLOW_* = * method in condor_config, however I heard some people say there is a new shared secret mechanism. Is this enabled by default in Condor v7? For reference, 10.0.10.41 is the localhost for the worker node where these log file entries are taken from (i.e. it is denying a local user).

It seems the 3600 second delay for restarting STARTD is due to a back-off algorithm -- in the log I can see earlier restarts had a shorter delay.

Cheers,

Ian

Greg Quinn wrote:

Ian,

I've seen problems like this in the past when there is a process running
that periodically deletes things in /tmp. Specifically, the Condor ProcD
daemon uses the configured LOCK directory to place named pipes over
which to communicate. If any of these pipes are externally deleted from
under the ProcD, errors like the ones you are seeing can result.

Greg

On Mon, 2008-03-31 at 15:12 -0400, Ian Stokes-Rees wrote:
I am getting a repeated sequence of errors on my worker nodes where
STARTD aborts due to a "fatal exception".  I only have two worker
nodes, and they are both doing these.  An extract of StartLog is
below.  I am running 7.0.1.  STARTD on the cluster head nodes does
work and jobs run there without a problem.

Suggestions as to why this is happening (appears to be due to "error
opening watchdog pipe", but I can't be certain), and how to resolve it
would be greatly appreciated.

Cheers,

Ian

3/31 14:18:12 slot3: State change: claiming protocol successful 3/31 14:18:12 slot3: Changing state: Matched -> Claimed 3/31 14:18:14 slot3: Got activate_claim request from shadow (<10.0.10.39:55786>) 3/31 14:18:14 slot3: Remote job ID is 1593.0 3/31 14:18:15 error opening watchdog pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2) 3/31 14:18:15 ProcFamilyClient: error initializing LocalClient 3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient 3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C 3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle -> Preempting/Killing 3/31 14:18:15 slot3: State change: No preempting claim, returning to owner 3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle 3/31 14:18:15 slot3: State change: IS_OWNER is false 3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed 3/31 14:18:15 startd exiting because of fatal exception.
--
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
SBGrid, Harvard Medical School             F: +1 617 432-5600

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

--
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
SBGrid, Harvard Medical School             F: +1 617 432-5600

------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/