[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"



I am getting a repeated sequence of errors on my worker nodes where STARTD aborts due to a "fatal exception".  I only have two worker nodes, and they are both doing these.  An extract of StartLog is below.  I am running 7.0.1.  STARTD on the cluster head nodes does work and jobs run there without a problem.

Suggestions as to why this is happening (appears to be due to "error opening watchdog pipe", but I can't be certain), and how to resolve it would be greatly appreciated.

Cheers,

Ian

3/31 14:18:12 slot3: State change: claiming protocol successful
3/31 14:18:12 slot3: Changing state: Matched -> Claimed
3/31 14:18:14 slot3: Got activate_claim request from shadow (<10.0.10.39:55786>)
3/31 14:18:14 slot3: Remote job ID is 1593.0
3/31 14:18:15 error opening watchdog pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2)
3/31 14:18:15 ProcFamilyClient: error initializing LocalClient
3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient
3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C
3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle -> Preempting/Killing
3/31 14:18:15 slot3: State change: No preempting claim, returning to owner
3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle
3/31 14:18:15 slot3: State change: IS_OWNER is false
3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed
3/31 14:18:15 startd exiting because of fatal exception.

-- 
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
SBGrid, Harvard Medical School             F: +1 617 432-5600