[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [Condor-users] Job Resubmit



On 6/30/2014 2:24 AM, Sunshine wrote:
> I submit some jobs.
> A few of jobs took 2 hours to complete, but I think the time should be 20m and some similar jobs indeed finished within 20minutes.
> I think something wrong with my jobs or clusters..
> 
> 
> My question: how do I let a job restart after a specific time?
> For example, if a job didn't finish within 5 minutes, then let the job resubmit?ãor restart on a different machine? â
> 

See the below submit file I wrote as an exmaple to do the above.  
Hopefully self-explanatory, esp if you take a look at the condor_submit manual
page for the expressions I used below - see 
  http://research.cs.wisc.edu/htcondor/manual/current/condor_submit.html
Feel free to ask any questions.
Hope this helps,
Todd

# Fill in executable and max expected runtime in minutes.
# If the job runs longer than expected, it will go on hold,
# and then will be restarted on a different machine.  After
# three restarts on three different machines, the job will 
# stay on hold.
#
executable = foo
expected_runtime_minutes = 5
#
# Should not need to change the below...
#
job_machine_attrs = Machine
job_machine_attrs_history_length = 4
requirements = target.machine =!= MachineAttrMachine1 && \
   target.machine =!= MachineAttrMachine2 && \
   target.machine =!= MachineAttrMachine3
periodic_hold = JobStatus == 2 && \
   CurrentTime - EnteredCurrentStatus > 60 * $(expected_runtime_minutes)
periodic_hold_subcode = 1
periodic_release = HoldReasonCode == 3 && HoldReasonSubCode == 1 && \
   JobRunCount < 3
periodic_hold_reason = ifthenelse(JobRunCount<3,"Ran too long, will retry","Ran too long")
queue