Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs repeatedly evicted after 30 mins

Date: Thu, 2 Mar 2006 14:04:46 -0600
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [Condor-users] Jobs repeatedly evicted after 30 mins

On Mar 1, 2006, at 9:19 PM, <Greg.Hitchen@xxxxxxxx><Greg.Hitchen@xxxxxxxx> wrote:

We have the situation where a user submits ~10 jobs,
all of which should run for ~5 hours. Many/most of
them get repeatedly evicted after 30 mins and requeued.
Below are the relevent logs from the submitting and execute
machines for one particular instance.

I have tested this myself with different jobs and the eviction
is ALWAYS ALMOST EXACTLY a few seconds (20?) under 30 minutes.

The line in the START LOG:

3/1 05:57:16 State change: claim timed out (condor_schedd gone?)

seems to be the relevant one?

ALL of the evictions (for different execute machines and different
jobs, same submit machine) occur at 30 minutes.

While a job is running, the schedd periodically sends an alivemessage to the startd via UDP. If the startd doesn't receive anyalive messages for a while, it will kill the claim (and the job). Thedefault is for the schedd to send an alive every 5 minutes and thestartd will kill the job if it misses 6 alives, which matches your 30minutes.

So it appears that UDP packets aren't making it from your submitmachine to your execute machines.


+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+

References:
- [Condor-users] Jobs repeatedly evicted after 30 mins
  - From: Greg.Hitchen

Prev by Date: Re: [Condor-users] Job ranking for Grid Universe jobs
Next by Date: Re: [Condor-users] Way to check for termination of jobs.
Previous by thread: Re: [Condor-users] Jobs repeatedly evicted after 30 mins
Next by thread: Re: [Condor-users] Jobs repeatedly evicted after 30 mins
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Jobs repeatedly evicted after 30 mins