[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs repeatedly evicted after 30 mins



On Mar 1, 2006, at 9:19 PM, <Greg.Hitchen@xxxxxxxx> <Greg.Hitchen@xxxxxxxx> wrote:

We have the situation where a user submits ~10 jobs,
all of which should run for ~5 hours. Many/most of
them get repeatedly evicted after 30 mins and requeued.
Below are the relevent logs from the submitting and execute
machines for one particular instance.

I have tested this myself with different jobs and the eviction
is ALWAYS ALMOST EXACTLY a few seconds (20?) under 30 minutes.

The line in the START LOG:

3/1 05:57:16 State change: claim timed out (condor_schedd gone?)

seems to be the relevant one?

ALL of the evictions (for different execute machines and different
jobs, same submit machine) occur at 30 minutes.

While a job is running, the schedd periodically sends an alive message to the startd via UDP. If the startd doesn't receive any alive messages for a while, it will kill the claim (and the job). The default is for the schedd to send an alive every 5 minutes and the startd will kill the job if it misses 6 alives, which matches your 30 minutes.

So it appears that UDP packets aren't making it from your submit machine to your execute machines.

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+