[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job evicted



On Sep 29, 2006, at 11:06 AM, Alan wrote:

I have some jobs, supposed to not take more than 10 min, taking hours
to carry out, and looking at their logs I read several lines like the
one below:

001 (10945.000.000) 09/29 16:44:14 Job executing on host: <172.24.89.68:46508>
...
004 (10945.000.000) 09/29 16:45:48 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        2367357  -  Run Bytes Sent By Job
        10400133  -  Run Bytes Received By Job

[and so on]

My job is always being evicted and resubmitted to another node, until
it finally terminates.

Is there a way to look deeper in such a behaviour? Is there something
that I can do to avoid or at least minimise it?

It gets weirder because I also submit several other jobs at the same
time and almost all of them terminates in a couple of minutes, as
expected.

The shadow log on your submit machine and the start and starter logs on the execute machines are the places to start looking for the cause of these evictions.

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+