[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Good way to start failed jobs from large cluster?

Hi Matt,

Matthew Farrellee wrote:
> 1) you can probably do condor_history -format "%d\n" ProcId -constraint
> "ClusterId == ?? && [magic to identify proc as a failure]" to get your Ids

In this case even easier, just looking at condor_q (see below)
> 2) are your failed jobs being removed from the queue, why not use an
> OnExit policy to put them on hold when [magic to identify proc as a
> failure] is identified. This would let you avoid the resubmission, you'd
> just have to release the jobs for them to run again.

In this case the user's job was running compiled Matlab code and it
seems that due to a race condition (which another user won) quite a few
jobs were still running and doing stupid things (stat directory, try to
open it, failing, sleep for .1s, stat dir again..., doing that for 2 days).

Thus getting the IDs was easy enough with condor_q. Condor_hold/_release
helped this time, but after that a few jobs showed some weird patterns
in the results and these we then wanted to run again (this could
have/could not have been linked the the incidence). Thus the quetion if
this simple for loop was nearly optimal already.

OnExit would not really help since human intervention/analysis was
needed on the results to find this issue.