[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job is getting rerun instead of terminated



On Jul 22, 2005, at 5:26 AM, Andreas Vetter wrote:

we have a setup that is meant to termminate all jobs after 12 hours 
runtime. Most jobs are vanilla universe. But sometimes there are jobs that 
are evicted after 12 hours and then started again on other nodes. The user 
finally killed the job with condor_rm. Other jobs are terminated after 12 
hours as expected.

Attached is part 3 of our global condor config and the users log for the 
restarting job.

Did I miss something? 

When an execute machine kills a job for running too long, the schedd doesn't consider the job complete. It thinks that the execute machine wasn't willing to let the job run long enough and it now needs to find another machine that will let the job run to completion. When a job leaves the queue is controlled by the job ad in the schedd.

If you want your jobs to leave the queue when they run longer than 12 hours, you need to set periodic_remove in the job ads. If you want the jobs to stay in the queue but not get rerun, you need to modify the startd's requirements to not run jobs that previously ran for more than 12 hours.

+----------------------------------+---------------------------------+

|            Jaime Frey            |  Public Split on Whether        |

|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |

|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |

+----------------------------------+---------------------------------+