[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job evicted



Hi List,

I have some jobs, supposed to not take more than 10 min, taking hours
to carry out, and looking at their logs I read several lines like the
one below:

001 (10945.000.000) 09/29 16:44:14 Job executing on host: <172.24.89.68:46508>
...
004 (10945.000.000) 09/29 16:45:48 Job was evicted.
       (0) Job was not checkpointed.
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
               Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
       2367357  -  Run Bytes Sent By Job
       10400133  -  Run Bytes Received By Job

[and so on]

My job is always being evicted and resubmitted to another node, until
it finally terminates.

Is there a way to look deeper in such a behaviour? Is there something
that I can do to avoid or at least minimise it?

It gets weirder because I also submit several other jobs at the same
time and almost all of them terminates in a couple of minutes, as
expected.

I would thank in advance any commentary.

Cheers,
Alan

--
Alan Wilter S. da Silva, D.Sc. - Research Associate
Department of Biochemistry, University of Cambridge.
80 Tennis Court Road, Cambridge CB2 1GA, UK.
http://www.bio.cam.ac.uk/~awd28<<