[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Distinguish preemption from crash



Hi,

We have some jobs that tend to fail once in a while because of temporary memory / disk / network issues. Restarting the jobs usually solve the problem but sometimes there are issues that make a job always crash, so restarting it unlimited times is just a waste of processors. When not having preemption enabled we used to limit the restart limit (by using an on exit hold expression after n restarts) but enabling preemption caused lots of problems since - from the job run count classad variable - there is no difference between preemption and a software problem and preemptions made reaching this restart limit quite fast.

What would you suggest doing to get around this problem? Can I somehow substract the number of preemptions from the job run count? Or should I add a custom attribute to count just the software crashes based on the return values?

Cheers,
Szabolcs