[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] OnExitRemove expression '(ExitCode == 0)' puts job on Hold when exit-by-signal...



Hi,

I have jobs being put on hold in the queue (i.e. getting the "H" in the condor_q command).
After a while I figured out that this is caused by a combination of two things:

1) checking the exit status of the job in the condor submit file:
      == 0)

2) Having the program exit with a signal, e.g. a segfault.

Here is an example:

int main() {
 /* this code intentionally causes a crash in the loop */
 int i, *d; for (i = 0; i < 1000000; ++i) d[i] = i*i;
 return 0;
}

When I compile this code (with mingw32 on a linux system) and submit to a WinXP pool PC, the job is put on hold with in the tail of the log xml-file:

    <a n="HoldReasonSubCode"><i>5</i></a>
    <a n="HoldReason"><s>The job attribute OnExitRemove _expression_ '( ExitCode == 0 )' evaluated to UNDEFINED</s></a>
    <a n="Proc"><i>0</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="CurrentTime"><e>time()</e></a>


I understand that it makes perfect sense to not resubmit this job, as it probably would crash again and again.
However, could it be an idea to make Condor a bit more informative here as to why "....evaluated to UNDEFINED"?
It took me a while to figure out that the exit-by-signal had caused this "UNDEFINED" to happen and consequently make the job go on hold....

Or do I miss some Condor essentials on this topic? Or is this already documented elsewhere?

Thank you.
Rob.