Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Distinguish preemption from crash

Date: Mon, 11 May 2009 22:37:05 +0200
From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
Subject: [Condor-users] Distinguish preemption from crash

Hi,

We have some jobs that tend to fail once in a while because of temporarymemory / disk / network issues. Restarting the jobs usually solve theproblembut sometimes there are issues that make a job always crash, sorestarting it unlimited times is just a waste of processors.When not having preemption enabled we used to limit the restart limit(by using an on exit hold expression after n restarts) butenabling preemption caused lots of problems since - from the job runcount classad variable - there is no difference between preemptionand a software problem and preemptions made reaching this restart limitquite fast.

What would you suggest doing to get around this problem? Can I somehowsubstract the number of preemptions from the job run count?Or should I add a custom attribute to count just the software crashesbased on the return values?


Cheers,
Szabolcs

Prev by Date: [Condor-users] Baby steps toward Parallel Universe
Next by Date: [Condor-users] Condor Jobs Automatic Kill and Resubmit
Previous by thread: [Condor-users] Baby steps toward Parallel Universe
Next by thread: [Condor-users] Condor Jobs Automatic Kill and Resubmit
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Distinguish preemption from crash