[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] log file indicates termination of job, but output file is empty !?!



 
Hi Rob

Ah yes, the old -10737*** error. Try googling it (without the -).
Some weird random windows error (unrelated to condor) that some machines
return one day but not the next. Unfortunately they can snaffle a lot
of jobs because they finish quickly and immediately are ready to accept
another job, which also fails, and so on. A remedy we have used is to
include this in the submit file.

on_exit_remove = (ExitCode == 0) 

this means that unless the ExitCode is zero (e.g. -107****) it will NOT
remove the job, it will be requeued, and hopefully this time execute 
properly on a different machine. Of course this will happen for anything
nonzero so you need to be careful if your code exits non-zero for any
other reason (e.g. file not found, etc). I guess you could check for exactly
-1073741502, maybe something like

on_exit_remove = (ExitCode =!= -1073741502) 

Note that we have seen two similar but slightly different -10737* type
error numbers.

Cheers

Greg



>Thank you. I have tested again using the error entry in the submit file.
>I found that both the error and output files are empty when this problem 
>re-occured.
>The log file tells me that the job has terminated.
>
>This is what I think is relevant in the log files:
>(Notice the exit status "-1073741502"; what does that mean?)
>
>ShadowLog:11/25 12:55:13 Initializing a VANILLA shadow for job 322.1464
>ShadowLog:11/25 12:55:13 (322.1464) (19248): Request to run on slot1@34-5 
><115.125.128.213:1045> was ACCEPTED