[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Automatic Restart of Failed jobs.



On 10/05/2010 10:32 PM, Edier Alberto Zapata Hernández wrote:
Good night,
  Today I was running some test with Exonerate using Condor. I split the
queries file in many files eachone with only 1 sequence in it. The
problem is that some of the jobs failed some because the node was down
when I put the Database in them, other because they crash, and so on.

I got the Error files of all the jobs, but check one by one, find the
job's files and restart it's a little slow (the queries file have
13,600+ sequences). Is there a parameter in the submitFile to define
that if the job fails (and only if It fails, I mean if the jobs finish
Ok, no actions should be taken.) Condor should try to restart it?

Thank you.

----
Edier Alberto Zapata Hernández
Est. Ingeniería de Sistemas
Universidad de Valle

Hopefully you can identify failed runs by the process's exit code, in which case you should consider on_exit_remove,

http://www.cs.wisc.edu/condor/manual/v7.5/condor_submit.html#73972

Or maybe on_exit_hold, if you want a chance to fix up files or the database server before the job is retried,

http://www.cs.wisc.edu/condor/manual/v7.5/condor_submit.html#73961

Best,


matt