Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Date: Wed, 20 May 2020 17:02:14 -0500
From: Brian Lin <blin@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Hi Stefano,

Thanks for dropping by office hours! As we discussed there, I suspectthat the problem is that the SYSTEM_PERIODIC_REMOVE configuration inyour local HTCondor config is triggering on the routed jobs, the sharedsandbox between CE and routed job is removed, then the CESYSTEM_PERIODIC_REMOVE kicks in but can't remove the job because itcan't write to the job log. To solve this problem, try:

1. Removing the SYSTEM_PERIODIC_REMOVE configuration from your localcondor config on the CE2. Setting "SYSTEM_PERIODIC_HOLD = ( $(SYSTEM_PERIOIDIC_HOLD) || <EXPR>)" in your HTCondor-CE configuration, replacing <EXPR> with theexpression that you removed from your local condor config in step 1.

This way you follow the CE convention of holding misbehaving jobs,leaving them around for the submitter if necessary, then removing themvia SYSTEM_PERIODIC_REMOVE after a grace period.


- Brian

On 5/20/20 11:25 AM, Brian Lin wrote:

Hi Stefano,
Thanks for running those queries! If EnteredCurrentStatus isconstantly updating, the jobs must be cycling between job statesfrequently. Could you pick one of these held jobs and run `grep <JOBID> /var/lib/condor-ce/spool/job_queue.log` as root?
Also, could you send me the entirety of your SchedLog and`condor_ce_config_val -dump` off list?
Thanks,
Brian

On 5/19/20 1:25 AM, Stefano Dal Pra wrote:
Hello Brian,
In my last mail i was considering a 16 hours time limit. Comparingaccording to that:
# x509UserProxyExpiration limit find 8 entries:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) &&(time() - x509UserProxyExpiration > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/19/20 08:02:10
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0
ops046 ID: 80       5/18 01:35      _      _      _      1 1 80.0
ops046 ID: 81       5/18 03:35      _      _      _      1 1 81.0
Total for query: 8 jobs; 0 completed, 0 removed, 0 idle, 0 running, 8held, 0 suspendedTotal for all users: 20 jobs; 6 completed, 0 removed, 0 idle, 0running, 14 held, 0 suspended
# EnteredCurrentStatus limit find None.
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) &&(time() - EnteredCurrentStatus > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/19/20 08:02:30OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0held, 0 suspendedTotal for all users: 20 jobs; 6 completed, 0 removed, 0 idle, 0running, 14 held, 0 suspended
It seems to me that EnteredCurrentStatus in this case is updatedregularly and never become old enough:
[root@ce01-lhcb-t2 ~]# date
mar 19 mag 2020, 08.20.54, CEST
[root@ce01-lhcb-t2 ~]# condor_ce_q 68 69 72 73 76 77 80 81 -af'abstime(EnteredCurrentStatus)'
absTime("2020-05-19T05:32:24+02:00")
absTime("2020-05-19T05:32:24+02:00")
absTime("2020-05-19T05:30:24+02:00")
absTime("2020-05-19T05:30:24+02:00")
absTime("2020-05-19T05:27:23+02:00")
absTime("2020-05-19T05:27:23+02:00")
absTime("2020-05-19T05:26:22+02:00")
absTime("2020-05-19T05:26:22+02:00")

Cheers,
Stefano


Il 18/05/20 23:49, Brian Lin ha scritto:
Hi Stefano,
In addition to the time difference in the system periodic remove vsyour queries, you're also comparing two different attributes againstthe current time: EnteredCurrentStatus and x509UserProxyExpiration,respectively. I'm curious what you see from the following query:
# condor_ce_q -cons '(JobStatus == 5 ) && (time() -EnteredCurrentStatus > 2 * 3600)'
Could you use that list of jobs for your analysis of the SchedLog?

Thanks,
Brian

On 5/18/20 4:27 PM, Stefano Dal Pra wrote:
Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there muchlonger anyway.
I have set the debug level as you said (on a brand new CE workingwith "ops" jobs only until now).I also reduced the remove policy to 2 hours (to be sure there issomething to purge).Before reconfiguring i selected jobid on hold for more than 16hours and found 12 such jobs:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) &&(time() - x509UserProxyExpiration > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @05/18/20 22:27:32OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1 1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1 1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1 1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1 1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1 1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1 1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0
After a condor_ce_reconfig (i actually did a restart too) a few ofthem are gone, and a few are still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.072.0 73.0 74.0 75.0 76.0 77.0-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 22:58:16OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0
Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running,6 held, 0 suspendedTotal for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0running, 10 held, 0 suspended
The SchedLog after reconfig has, for job 71.0 (this has been removed):
05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0action:Remove log_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logowner=ops04605/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logsuccessfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed bySYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.
Looking for job 68.0 (not removed) however, there is nothing afterreconfiguration time (22:30):
05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Holdlog_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logowner=ops04605/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logsuccessfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueueact_on_job_myself_queue is empty, not resetting timer
And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 23:23:49OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0

Cheers
Stefano



Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,
I'm a little confused, your system periodic remove expressionsseem to remove jobs that have been held for more than 8 hourswhereas your queries are looking for held jobs whose proxies havebeen expired for more than 4 hours. I imagine there's some overlapbut they seem like fairly different queries.
Though having the RemoveReason set that ways is pretty strange. Ifyou set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hintsin the SchedLog as to why the Schedd is failing to remove these jobs.
Thanks,
Brian

On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.

We have a problem common to all of our CEs:
[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) &&(time() - x509UserProxyExpiration > 4 * 3600)' -af Owner | sort |uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5

I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime -EnteredCurrentStatus > 3600*8)SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed bySYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 &&CurrentTime - EnteredCurrentStatus > 3600*8), "being in the holdstate for 8 hours.", ifThenElse((JobStatus == 5 &&isUndefined(RoutedToJobId)), "non-existent route or entry inJOB_ROUTER_ENTRIES.", "input files missing." ) ) )
Inspecting these "non purged jobs", they have a RemoveReason set,but they are not gone nevertheless:
[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in thehold state for 8 hours.
Until now i have no better way than removing these jobs manuallyusing somethin like:condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 4 * 3600)' -af'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm
Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin

Prev by Date: [HTCondor-users] HTCondor 8.9.7 Released
Next by Date: [HTCondor-users] execute hosts advertise loopback address
Previous by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Next by thread: [HTCondor-users] [HTCondor-Users] Cgroup memory hard limit
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] HTCondor-CE not purging finished jobs