[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] criteria for non-DAG job failures?

On Wed, Jun 27, 2012 at 11:06:21AM -0400, Vlad wrote:
> Greetings,
> Condor documentation provides some details for what's considered to be a job failure for DAG submissions (e.g. http://research.cs.wisc.edu/condor/manual/v7.8/2_10DAGMan_Applications.html#SECTION003105000000000000000) and that seems to cover process exit codes.
> What about non-DAG (cluster) jobs? I use 'notification = error' and the empirical observation (using a very new v7.8 install) is that I do get emails when jobs crash as a result of SIGBUS, etc. However, if a job returns with a non-zero error code (e.g. non-zero return from main() in C/C++) there are no emails. Is it possible to change this behavior? Could this be a matter of changing the default Condor configuration or using the appropriate submit descriptor incantation?


For pool-wide configuration, you can use the following config line:

SYSTEM_PERIODIC_HOLD = ExitBySignal =?= True || ExitCode =!= 0

You could put a similar line in your submit file for per-job

on_exit_hold = ExitBySignal =?= True || ExitCode =!= 0
notification = Error

Nathan Panike