Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"

Date: Mon, 31 Mar 2008 15:12:00 -0400
From: Ian Stokes-Rees <ijstokes@xxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"

I am getting a repeated sequence of errors on my worker nodes where STARTD aborts due to a "fatal exception". I only have two worker nodes, and they are both doing these. An extract of StartLog is below. I am running 7.0.1. STARTD on the cluster head nodes does work and jobs run there without a problem.

Suggestions as to why this is happening (appears to be due to "error opening watchdog pipe", but I can't be certain), and how to resolve it would be greatly appreciated.

Cheers,

Ian
3/31 14:18:12 slot3: State change: claiming protocol successful 3/31 14:18:12 slot3: Changing state: Matched -> Claimed 3/31 14:18:14 slot3: Got activate_claim request from shadow (<10.0.10.39:55786>) 3/31 14:18:14 slot3: Remote job ID is 1593.0 3/31 14:18:15 error opening watchdog pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2) 3/31 14:18:15 ProcFamilyClient: error initializing LocalClient 3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient 3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C 3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle -> Preempting/Killing 3/31 14:18:15 slot3: State change: No preempting claim, returning to owner 3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle 3/31 14:18:15 slot3: State change: IS_OWNER is false 3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed 3/31 14:18:15 startd exiting because of fatal exception.

-- 
Ian Stokes-Rees                            W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 418-4168
SBGrid, Harvard Medical School             F: +1 617 432-5600

Follow-Ups:
- Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"
  - From: Greg Quinn

Prev by Date: Re: [Condor-users] Abnormal Termination received Signal 11
Next by Date: Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"
Previous by thread: Re: [Condor-users] Standard universe on OS X
Next by thread: Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"