[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Automate removal of inefficient jobs



Hi Ian,

Thanks, I will start from what you've suggested and let you know how it
goes.  One thing I am unclear on, by current run you mean a job that has
been held and then restarted?

--Sarah

On 7/12/11 11:20 AM, Ian Chesal wrote:
> On Tuesday, July 12, 2011 at 11:05 AM, Sarah Williams wrote:
>> Hi Condor users & experts,
>>
>> I have one user on my cluster whose jobs are usually well-behaved, but
>> sometimes stall on contacting a remote server. I manually kill those
>> jobs when I notice them, but I'd like to get that automated. The
>> typical sign of a stalled job is one that has >1hr of walltime, and
>> <1min of cputime.
>>
>> Is there a way to have condor automatically remove these jobs?
> This is a touch tricky. I'm not sure how you get cumulative user + sys
> CPU for a job for *just* the current run. But you can see the cumulative
> user + sys CPU numbers for all the times a job has run. Plus the
> wallclock time for all runs.
> 
> So if you wanted to do this for just a particular job, you'd add to its
> submit ticket something like:
> 
> periodic_remove = (JobStatus == 2) && (CurrentTime -
> EnteredCurrentStatus > 3600) && ((RemoteSysCpu + RemoteUserCpu) < 61)
> 
> Or:
> 
> periodic_remove = (RemoteWallClockTime > 3600)  && ((RemoteSysCpu +
> RemoteUserCpu) < 61)
> 
> The first one says: remove this job if it's current been running for
> greater than one hour and the total sys+user CPU time it's managed to
> accumulate across all its run attempts is less than 61 seconds.
> 
> The second one says: remove this job if it's accumulated more than an
> hour of remote run time but has less than 61 seconds of remote sys+user
> CPU time.
> 
> They're slightly different but mostly what you want.
> 
> If you wanted these settings to apply to all jobs submitted to a
> scheduler you could add:
> 
> SYSTEM_PERIOCID_REMOVE = <expression>
> 
> To the condor_config.local for the scheduler machine and reconfigure the
> scheduler. Then all jobs submitted to that scheduler would be subject to
> this removal expression.
> 
> - Ian
> 
> ---
> Ian Chesal
> 
> Cycle Computing, LLC
> Leader in Open Compute Solutions for Clouds, Servers, and Desktops
> Enterprise Condor Support and Management Tools
> 
> http://www.cyclecomputing.com
> http://www.cyclecloud.com
> http://twitter.com/cyclecomputing 
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/