[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] centrally force removal after some time even if leave_in_queue is true?



Hi Todd,

your solution seems to work as the LeavJobInQueue classadd is changed [1] and correctly evaluates to false when some expiration time has passed [2]. But indeed, as Michael said, it does not really fix my problem since the jobs are not removed from the queue (in the sense that they still appear in condor_q output).
Is this because something is not well configured on our schedd?
If not I guess only a cron running "condor_rm -xforce ..." can fix the issue...

(anyways, job-transform seems indeed very powerful)

Regards,
Andrea


[1]
[root@llrmpicream ~]# condor_q -long 217401.0|grep InQueue
LeaveJobInQueue = ( JobStatus == 3 && ( time() - EnteredCurrentStatus ) > 500 ) ? false : SubmitterLeaveJobInQueue SubmitterLeaveJobInQueue = ( CompletionDate =?= undefined || CompletionDate == 0 || ( ( CurrentTime - CompletionDate ) < 1800 ) ) [root@llrmpicream ~]# condor_q -constraint 'JobStatus ==3 && !LeaveJobInQueue'

[2]
-- Schedd: llrmpicream.in2p3.fr : <134.158.132.244:9125> @ 11/07/18 17:31:33
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS cmspilot CMD: CREAM467257994_jobWrapper.sh 11/7 16:21 _ _ _ 1 217399.0 cmspilot CMD: CREAM078494348_jobWrapper.sh 11/7 16:21 _ _ _ 1 217400.0 ops001 CMD: CREAM155506574_jobWrapper.sh 11/7 16:24 _ _ _ 1 217401.0 ops000 CMD: CREAM056514266_jobWrapper.sh 11/7 16:42 _ _ _ 1 217405.0

4 jobs; 0 completed, 4 removed, 0 idle, 0 running, 0 held, 0 suspended
[root@llrmpicream ~]#




On 31/10/2018 16:32, Todd Tannenbaum wrote:
On 10/31/2018 5:49 AM, Andrea Sartirana wrote:
Hi,

much is in the title.
I was wondering if there is a way to force removal from the queue of the
X state jobs after some centrally defined time even if the
leave_in_queue expression given by the user at submission still
evaluates to true. I'm running 8.6.0, vanilla universe, direct submission.

I've tried to include garbage collecting of the remove jobs in the
SYSTEM_PERIODIC_REMOVE but this does not seem to have the desired effect.

Regards
Andrea

Hi Andrea,

There may be an easier way, but a quick thought is you could use Job Transforms to accomplish the above.   Job Transforms allow you, the administrator, to edit job classads upon submission --- see this section of the v8.6 manual:

   http://htcondor.org/manual/v8.6/3_7Policy_Configuration.html#38930

So the idea here is to configure your schedd to edit the user's leave_in_queue expression (which ends up in the job classad as attribute LeaveJobInQueue) so that it will always evaluate to False for X state jobs after a specified amount of time, else fall back to whatever the user wanted.

Try appending the below to the HTCondor configuration (it will be used by your submit machines, and ignored on machines not running a schedd) to allow jobs in X state to leave the queue after 120 seconds regardless of what the user's submit file says:

JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) LeaveInQueue
JOB_TRANSFORM_LeaveInQueue @=end
[
     copy_LeaveJobInQueue = "SubmitterLeaveJobInQueue";
     set_LeaveJobInQueue = (JobStatus == 3 && (time() - EnteredCurrentStatus) > 120) ? False : SubmitterLeaveJobInQueue
]
@end

Warning - the above is off the top of my head, I did not test it.

Seems like HTCondor would benefit from a SYSTEM_LEAVE_IN_QUEUE knob to make doing the above simpler.  But Job Transforms are a pretty powerful generic tool.

Hope the above helps.

regards,
Todd