[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"



I think there are multiple compounding problems here, and the disappearance of lock files is just one part (this may be due to tmpwatch).  Looking in the MasterLog file on the worker node, I have the following lines:

3/31 15:23:51 Started DaemonCore process "/se/app/shared/condor/sbin/condor_startd", pid and pgroup = 26507
3/31 15:26:34 The STARTD (pid 26507) exited with status 4
3/31 15:26:34 restarting /se/app/shared/condor/sbin/condor_startd in 3600 seconds
3/31 16:00:24 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:41217> for command 454 (DAEMONS_OFF), access level ADMINISTRATOR
3/31 16:00:30 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:34627> for command 455 (DAEMONS_ON), access level ADMINISTRATOR

I have pretty much turned off any access control via the HOSTALLOW_* = * method in condor_config, however I heard some people say there is a new shared secret mechanism.  Is this enabled by default in Condor v7?  For reference, 10.0.10.41 is the localhost for the worker node where these log file entries are taken from (i.e. it is denying a local user).

It seems the 3600 second delay for restarting STARTD is due to a back-off algorithm -- in the log I can see earlier restarts had a shorter delay.

Cheers,

Ian

Greg Quinn wrote:
Ian,

I've seen problems like this in the past when there is a process running
that periodically deletes things in /tmp. Specifically, the Condor ProcD
daemon uses the configured LOCK directory to place named pipes over
which to communicate. If any of these pipes are externally deleted from
under the ProcD, errors like the ones you are seeing can result.

Greg

On Mon, 2008-03-31 at 15:12 -0400, Ian Stokes-Rees wrote:
  
I am getting a repeated sequence of errors on my worker nodes where
STARTD aborts due to a "fatal exception".  I only have two worker
nodes, and they are both doing these.  An extract of StartLog is
below.  I am running 7.0.1.  STARTD on the cluster head nodes does
work and jobs run there without a problem.

Suggestions as to why this is happening (appears to be due to "error
opening watchdog pipe", but I can't be certain), and how to resolve it
would be greatly appreciated.

Cheers,

Ian

3/31 14:18:12 slot3: State change: claiming protocol successful 
3/31 14:18:12 slot3: Changing state: Matched -> Claimed 
3/31 14:18:14 slot3: Got activate_claim request from shadow
(<10.0.10.39:55786>) 
3/31 14:18:14 slot3: Remote job ID is 1593.0 
3/31 14:18:15 error opening watchdog
pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2) 
3/31 14:18:15 ProcFamilyClient: error initializing LocalClient 
3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient 
3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file
proc_family_proxy.C 
3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle ->
Preempting/Killing 
3/31 14:18:15 slot3: State change: No preempting claim, returning to
owner 
3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing
-> Owner/Idle 
3/31 14:18:15 slot3: State change: IS_OWNER is false 
3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed 
3/31 14:18:15 startd exiting because of fatal exception.
-- 
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
SBGrid, Harvard Medical School             F: +1 617 432-5600

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
    

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
  

-- 
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
SBGrid, Harvard Medical School             F: +1 617 432-5600