OK, maybe this is the issue... on the same slot I get messages like this around the time the job was vacated:
12/09/10 13:06:04 attempt to connect to <127.0.0.1:39905
> failed: Connection refused (connect errno = 111).
12/09/10 13:06:09 slot4: State change: claim lease expired (condor_schedd gone?)
12/09/10 13:06:09 slot4: Changing state and activity: Claimed/Busy -> Preempting/Killing
12/09/10 13:06:09 slot4: Got KILL_FRGN_JOB while in Preempting state, ignoring.
12/09/10 13:06:09 Starter pid 11281 exited with status 0
12/09/10 13:06:09 slot4: State change: starter exited
12/09/10 13:06:09 slot4: State change: No preempting claim, returning to owner
schedd isn't even running on that machine.... it's got MASTER and STARTD only... (as it should), job was started from elsewhere (Ican verify the machine it started from).
12/09/10 12:46:09 slot4: match_info called
12/09/10 12:46:09 slot4: Remote job ID is 481.0
12/09/10 12:46:10 slot4: Got universe "VANILLA" (5) from request classad
12/09/10 12:46:10 slot4: State change: claim-activation protocol successful
12/09/10 12:46:10 slot4: Changing activity: Idle -> Busy
On Thu, Dec 9, 2010 at 2:59 PM, Erik Aronesty <erik@xxxxxxx>
OK I tried everything you said... my jobs are still restarting every 20 minutes for no reason I can think of.
On Thu, Dec 9, 2010 at 3:16 PM, Matthew Farrellee <matt@xxxxxxxxxx>
You should have a look at the StartLog and see what happens around the state changes you posted earlier.
On 12/09/2010 02:59 PM, Erik Aronesty wrote:
OK I tried everything you said... my jobs are still restarting every 20
minutes for no reason I can think of.