[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Automate removal of inefficient jobs



That's good stuff. Remember you can try it out by just running...

condor_q -constraint '(JobStatus == 2)&& (CurrentTime - EnteredCurrentStatus> 3600)&& ((RemoteSysCpu + RemoteUserCpu)< 61)'

...to get a list of jobs that would be removed.

Best,


matt

On 07/12/2011 11:31 AM, Sarah Williams wrote:
Hi Ian,

Thanks, I will start from what you've suggested and let you know how it
goes.  One thing I am unclear on, by current run you mean a job that has
been held and then restarted?

--Sarah

On 7/12/11 11:20 AM, Ian Chesal wrote:
On Tuesday, July 12, 2011 at 11:05 AM, Sarah Williams wrote:
Hi Condor users&  experts,

I have one user on my cluster whose jobs are usually well-behaved, but
sometimes stall on contacting a remote server. I manually kill those
jobs when I notice them, but I'd like to get that automated. The
typical sign of a stalled job is one that has>1hr of walltime, and
<1min of cputime.

Is there a way to have condor automatically remove these jobs?
This is a touch tricky. I'm not sure how you get cumulative user + sys
CPU for a job for *just* the current run. But you can see the cumulative
user + sys CPU numbers for all the times a job has run. Plus the
wallclock time for all runs.

So if you wanted to do this for just a particular job, you'd add to its
submit ticket something like:

periodic_remove = (JobStatus == 2)&&  (CurrentTime -
EnteredCurrentStatus>  3600)&&  ((RemoteSysCpu + RemoteUserCpu)<  61)

Or:

periodic_remove = (RemoteWallClockTime>  3600)&&  ((RemoteSysCpu +
RemoteUserCpu)<  61)

The first one says: remove this job if it's current been running for
greater than one hour and the total sys+user CPU time it's managed to
accumulate across all its run attempts is less than 61 seconds.

The second one says: remove this job if it's accumulated more than an
hour of remote run time but has less than 61 seconds of remote sys+user
CPU time.

They're slightly different but mostly what you want.

If you wanted these settings to apply to all jobs submitted to a
scheduler you could add:

SYSTEM_PERIOCID_REMOVE =<expression>

To the condor_config.local for the scheduler machine and reconfigure the
scheduler. Then all jobs submitted to that scheduler would be subject to
this removal expression.

- Ian