[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs vacating reason



I also verified that port 39905 is the right port and 192.168.16.123 is listening.  So now I have to convince the remote machines to NOT use 127.0.0.1 and communicate with schedd via the IP that the requests came in on.

On Thu, Dec 9, 2010 at 4:23 PM, Erik Aronesty <erik@xxxxxxx> wrote:
OK, the jobs are lasting "exactly" 20 minutes.   Which is, coincidentally, the duration of "JobLeaseDuration".   If I raise the "JobLeaseDuration" they last longer.   This means (to me) that ALIVE messages aren't getting through.   Right?   So that log is spot on....but why on earth would it be connecting to "localhost" to send an ALIVE?


On Thu, Dec 9, 2010 at 3:27 PM, Erik Aronesty <erik@xxxxxxx> wrote:
OK, maybe this is the issue... on the same slot I get messages like this around the time the job was vacated:


12/09/10 13:06:04 attempt to connect to <127.0.0.1:39905> failed: Connection refused (connect errno = 111).
12/09/10 13:06:04 slot4: Failed to connect to schedd <127.0.0.1:39905>
12/09/10 13:06:09 slot4: State change: claim lease expired (condor_schedd gone?)
12/09/10 13:06:09 slot4: Changing state and activity: Claimed/Busy -> Preempting/Killing
12/09/10 13:06:09 slot4: Got KILL_FRGN_JOB while in Preempting state, ignoring.
12/09/10 13:06:09 Starter pid 11281 exited with status 0
12/09/10 13:06:09 slot4: State change: starter exited
12/09/10 13:06:09 slot4: State change: No preempting claim, returning to owner

schedd isn't even running on that machine.... it's got MASTER and STARTD only... (as it should), job was started from elsewhere (Ican verify the machine it started from).

12/09/10 12:46:09 slot4: match_info called
12/09/10 12:46:09 slot4: Got activate_claim request from shadow (<192.168.16.123:42331>)
12/09/10 12:46:09 slot4: Remote job ID is 481.0
12/09/10 12:46:10 slot4: Got universe "VANILLA" (5) from request classad
12/09/10 12:46:10 slot4: State change: claim-activation protocol successful
12/09/10 12:46:10 slot4: Changing activity: Idle -> Busy

On Thu, Dec 9, 2010 at 2:59 PM, Erik Aronesty <erik@xxxxxxx> wrote:

OK I tried everything you said... my jobs are still restarting every 20 minutes for no reason I can think of.

- Erik


On Thu, Dec 9, 2010 at 3:16 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
On 12/09/2010 02:59 PM, Erik Aronesty wrote:
OK I tried everything you said... my jobs are still restarting every 20
minutes for no reason I can think of.

- Erik

You should have a look at the StartLog and see what happens around the state changes you posted earlier.

Best,


matt