[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Automate removal of inefficient jobs



Hi all,

I used condor_q to test this statement, and it selects the correct jobs.
 condor_q -constraint ' User =?= "user1@xxxxxxxxxxxxxxx" && (JobStatus
== 2) && (CurrentTime - EnteredCurrentStatus > 3600) && ((RemoteSysCpu +
RemoteUserCpu) < 61)'

So, I set SYSTEM_PERIODIC_REMOVE equal to that value on the schedd host,
verified it with condor_config_val, and waited. But, it does not seem to
be removing the jobs.  The ScheddLog does not have any unusual entries.

I tried wrapping the statement with debug(), but no debug messages are
printed to the log.  Also tried SCHEDD_DEBUG = D_FULLDEBUG D_COMMAND,
but there were no messages about periodic_remove in the output.

Tomorrow I will try setting periodic_remove per job and see if that
works ....

--Sarah

On 7/12/11 1:23 PM, Matthew Farrellee wrote:
> That's good stuff. Remember you can try it out by just running...
> 
> condor_q -constraint '(JobStatus == 2)&&  (CurrentTime -
> EnteredCurrentStatus>  3600)&&  ((RemoteSysCpu + RemoteUserCpu)<  61)'
> 
> ...to get a list of jobs that would be removed.
> 
> Best,
> 
> 
> matt
> 
> On 07/12/2011 11:31 AM, Sarah Williams wrote:
>> Hi Ian,
>>
>> Thanks, I will start from what you've suggested and let you know how it
>> goes.  One thing I am unclear on, by current run you mean a job that has
>> been held and then restarted?
>>
>> --Sarah
>>
>> On 7/12/11 11:20 AM, Ian Chesal wrote:
>>> On Tuesday, July 12, 2011 at 11:05 AM, Sarah Williams wrote:
>>>> Hi Condor users&  experts,
>>>>
>>>> I have one user on my cluster whose jobs are usually well-behaved, but
>>>> sometimes stall on contacting a remote server. I manually kill those
>>>> jobs when I notice them, but I'd like to get that automated. The
>>>> typical sign of a stalled job is one that has>1hr of walltime, and
>>>> <1min of cputime.
>>>>
>>>> Is there a way to have condor automatically remove these jobs?
>>> This is a touch tricky. I'm not sure how you get cumulative user + sys
>>> CPU for a job for *just* the current run. But you can see the cumulative
>>> user + sys CPU numbers for all the times a job has run. Plus the
>>> wallclock time for all runs.
>>>
>>> So if you wanted to do this for just a particular job, you'd add to its
>>> submit ticket something like:
>>>
>>> periodic_remove = (JobStatus == 2)&&  (CurrentTime -
>>> EnteredCurrentStatus>  3600)&&  ((RemoteSysCpu + RemoteUserCpu)<  61)
>>>
>>> Or:
>>>
>>> periodic_remove = (RemoteWallClockTime>  3600)&&  ((RemoteSysCpu +
>>> RemoteUserCpu)<  61)
>>>
>>> The first one says: remove this job if it's current been running for
>>> greater than one hour and the total sys+user CPU time it's managed to
>>> accumulate across all its run attempts is less than 61 seconds.
>>>
>>> The second one says: remove this job if it's accumulated more than an
>>> hour of remote run time but has less than 61 seconds of remote sys+user
>>> CPU time.
>>>
>>> They're slightly different but mostly what you want.
>>>
>>> If you wanted these settings to apply to all jobs submitted to a
>>> scheduler you could add:
>>>
>>> SYSTEM_PERIOCID_REMOVE =<expression>
>>>
>>> To the condor_config.local for the scheduler machine and reconfigure the
>>> scheduler. Then all jobs submitted to that scheduler would be subject to
>>> this removal expression.
>>>
>>> - Ian
>>>