[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor lease duration not working??



On Mon, 16 May 2005 20:51:54 -0700  John Wheez wrote:

> Machines on my pool fail to notify the condor collector that the
> tasks have finished and remain in a busy state even though the job
> finished successfully and there is no CPU utilization.

you must mean "fail to notify the condor_schedd where i submitted
them", since the machines never notify the collector about anything
when jobs complete.

> Eventually every CPU in my pool becomes permenantly busy..

weird.

> I even set the job_lease_duration = 400 in my submit file...but this
> does not get my cpus back in my pool...below is the error from one
> of the starter.log files.

hmm.  

> Any ideas???
>
> 5/16 10:12:57 Create_Process succeeded, pid=3368
> 5/16 10:13:23 Process exited, pid=3368, status=0
> 5/16 10:13:47 getpeername failed so connect must have failed
> 5/16 10:14:12 Connect failed for 30 seconds; returning FALSE
> 5/16 10:14:12 FileTransfer: Unable to connect to server <192.168.0.3:9635>

that's the really big problem.  are you having private network errors?
(that 192.168 ip address makes me nervous). do you have a weird
firewall in place that's blocking outgoing connections from the
execute machines back to your submit machine?  do you have multiple
NICs on the machines?  is your NETWORK_INTERFACE set correctly?  one
of those issues seems like the real problem.

> 5/16 10:14:12 JIC::allJobsDone() failed, waiting for job lease to expire 
> or for a reconnect attempt

then what?  nothing else in the logs at all?  does the StartLog say
anything around this time?  is the starter process still on the
machine?  it's the startd that really enforces the job_lease_duration,
so if there was a real network failure, the StartLog should say
something like:

"State change: claim lease expired (condor_schedd gone?)"

if not, it probably means your schedd is still in touch with the
startd, sending keep-alives.  if you wanted to be sure, enable
"D_PROTOCOL" in your STARTD_DEBUG config file setting, reconfig your
startd, and you should start to see messages like this every 5 minutes
or so:

"Keep alive for ClaimId xxxxxxxxx"

i'm guessing your network allows direct connections from submit ->
execute, but not from execute -> submit.  since the connections from
submit -> execute are working, your schedd is still happily sending
keep-alives to renew the job lease, the startd therefore thinks the
claim is still valid.  the shadow, meanwhile, is waiting to hear
something from the starter that the job finished, but the starter is
screwed... it can't connect back to the shadow to tell it, and it's
waiting for someone to talk to it.

granted, condor's not handling this failure case particularly well,
but if my guesses are right, this is a pretty nasty case for condor to
do anything intelligent...

-derek