[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Capping job run time with PERIODIC_REMOVE



Hi, my team and I are currently using condor 8.6.1 and are currently trying to cap the runtime of jobs that may begin to loop forever due to any number of issues such as bugs in the executables we build our application on top of or endless timeout HTTP requests.  I have gotten these statements to work for the most part but have come across a corner case.  Say my condor submit file has a PERIODIC_REMOVE statement as follows:

((JobStatus==2) && (CurrentTime - EnteredCurrentStatus) > 60)

 

When running this job, if the job terminates successfully after the allotted 60 seconds but before condor has actually had a chance to remove this job then the DAG believes this is a failed node.  The code that we run on these machines relies on receiving the SIGTERM sent by condor_rm in order to report the timeout correctly to a REST Server but in this situation the code has already terminated before condor could do so.  I have tried looking into condor configuration and found statements for PERIODIC_EXPR_INTERVAL, MAX_PERIODIC_EXPR_INTERVAL, and PERIODIC_EXPR_TIMESLICE, though I saw the note regarding that evaluating the periodic statements for a queue with many jobs can be costly and degrade the performance of the job scheduling machine.  I can see situations where our queue could be operating on up to 1000 running jobs with 1000 more sitting idle in the backlog waiting for executing machines.  My hope would be there is some way to play with these configuration variables to make the removal interval tighter so that the code cannot complete execution before receiving the SIGTERM but I do not want to do it at a risk of the scheduler performance degradation.  Could anyone offer some suggestions?  Thank you