Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Date: Mon, 18 May 2020 16:49:56 -0500
From: Brian Lin <blin@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Hi Stefano,

In addition to the time difference in the system periodic remove vs yourqueries, you're also comparing two different attributes against thecurrent time: EnteredCurrentStatus and x509UserProxyExpiration,respectively. I'm curious what you see from the following query:

# condor_ce_q -cons '(JobStatus == 5 ) && (time() - EnteredCurrentStatus> 2 * 3600)'


Could you use that list of jobs for your analysis of the SchedLog?

Thanks,
Brian

On 5/18/20 4:27 PM, Stefano Dal Pra wrote:

Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there muchlonger anyway.
I have set the debug level as you said (on a brand new CE workingwith "ops" jobs only until now).I also reduced the remove policy to 2 hours (to be sure there issomething to purge).Before reconfiguring i selected jobid on hold for more than 16 hoursand found 12 such jobs:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time()- x509UserProxyExpiration > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @05/18/20 22:27:32
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1      1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1      1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1      1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1      1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1      1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1      1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0
After a condor_ce_reconfig (i actually did a restart too) a few ofthem are gone, and a few are still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.0 72.073.0 74.0 75.0 76.0 77.0-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 22:58:16
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0
Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running, 6held, 0 suspendedTotal for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0running, 10 held, 0 suspended
The SchedLog after reconfig has, for job 71.0 (this has been removed):
05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Removelog_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logowner=ops04605/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logsuccessfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed bySYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.
Looking for job 68.0 (not removed) however, there is nothing afterreconfiguration time (22:30):
05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Holdlog_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logowner=ops04605/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logsuccessfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueueact_on_job_myself_queue is empty, not resetting timer
And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 23:23:49
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0

Cheers
Stefano



Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,
I'm a little confused, your system periodic remove expressions seemto remove jobs that have been held for more than 8 hours whereas yourqueries are looking for held jobs whose proxies have been expired formore than 4 hours. I imagine there's some overlap but they seem likefairly different queries.
Though having the RemoveReason set that ways is pretty strange. Ifyou set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hints inthe SchedLog as to why the Schedd is failing to remove these jobs.
Thanks,
Brian

On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.

We have a problem common to all of our CEs:
[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5

I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime -EnteredCurrentStatus > 3600*8)SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed bySYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 &&CurrentTime - EnteredCurrentStatus > 3600*8), "being in the holdstate for 8 hours.", ifThenElse((JobStatus == 5 &&isUndefined(RoutedToJobId)), "non-existent route or entry inJOB_ROUTER_ENTRIES.", "input files missing." ) ) )
Inspecting these "non purged jobs", they have a RemoveReason set,but they are not gone nevertheless:
[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the holdstate for 8 hours.
Until now i have no better way than removing these jobs manuallyusing somethin like:condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 4 * 3600)' -af'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm
Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra

References:
- [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra

Prev by Date: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Next by Date: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Previous by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Next by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] HTCondor-CE not purging finished jobs