On Mar 7, 2008, at 2:27 PM, Craig Bruce wrote:
the on_exit_remove or on_exit_hold can trap this and place it on hold for you to deal with.I can't use either of these as the job never gets as far as exiting, it just goes back to idle and will resubmit to get the same error, ad infinitum.The exit code is 1, as an abnormal termination, so I tried this inon_exit_hold and periodic_hold, but the first doesn't run and second runsbefore the exitcode is defined. Is there something like on_evict_hold? I couldn't find anything in the manual.
Condor evaluates on_exit_hold/remove when the job completes and is ready to leave the queue. Since Condor leaves the job in the queue on OutOfMemory, the on_exit expressions are evaluated.
Here's how you can use periodic_hold: periodic_hold = NumJobStarts =!= Undefined && NumJobStarts > 2The first half of the expression is required because NumJobStarts isn't defined in the job ad until it starts running for the first time.
This will catch jobs that re-execute other reasons as well, but it will stop infinite re-execution.
Thanks and regards, Jaime Frey UW-Madison Condor Team