[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"



Ian,

I've seen problems like this in the past when there is a process running
that periodically deletes things in /tmp. Specifically, the Condor ProcD
daemon uses the configured LOCK directory to place named pipes over
which to communicate. If any of these pipes are externally deleted from
under the ProcD, errors like the ones you are seeing can result.

Greg

On Mon, 2008-03-31 at 15:12 -0400, Ian Stokes-Rees wrote:
> I am getting a repeated sequence of errors on my worker nodes where
> STARTD aborts due to a "fatal exception".  I only have two worker
> nodes, and they are both doing these.  An extract of StartLog is
> below.  I am running 7.0.1.  STARTD on the cluster head nodes does
> work and jobs run there without a problem.
> 
> Suggestions as to why this is happening (appears to be due to "error
> opening watchdog pipe", but I can't be certain), and how to resolve it
> would be greatly appreciated.
> 
> Cheers,
> 
> Ian
> 
> 3/31 14:18:12 slot3: State change: claiming protocol successful 
> 3/31 14:18:12 slot3: Changing state: Matched -> Claimed 
> 3/31 14:18:14 slot3: Got activate_claim request from shadow
> (<10.0.10.39:55786>) 
> 3/31 14:18:14 slot3: Remote job ID is 1593.0 
> 3/31 14:18:15 error opening watchdog
> pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2) 
> 3/31 14:18:15 ProcFamilyClient: error initializing LocalClient 
> 3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient 
> 3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file
> proc_family_proxy.C 
> 3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle ->
> Preempting/Killing 
> 3/31 14:18:15 slot3: State change: No preempting claim, returning to
> owner 
> 3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing
> -> Owner/Idle 
> 3/31 14:18:15 slot3: State change: IS_OWNER is false 
> 3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed 
> 3/31 14:18:15 startd exiting because of fatal exception.
> -- 
> Ian Stokes-Rees                            W: http://sbgrid.org
> ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
> SBGrid, Harvard Medical School             F: +1 617 432-5600
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/