[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE not purging finished jobs



Hi Stefano,

Thanks for dropping by office hours! As we discussed there, I suspect that the problem is that the SYSTEM_PERIODIC_REMOVE configuration in your local HTCondor config is triggering on the routed jobs, the shared sandbox between CE and routed job is removed, then the CE SYSTEM_PERIODIC_REMOVE kicks in but can't remove the job because it can't write to the job log. To solve this problem, try:

1. Removing the SYSTEM_PERIODIC_REMOVE configuration from your local condor config on the CE 2. Setting "SYSTEM_PERIODIC_HOLD = ( $(SYSTEM_PERIOIDIC_HOLD) || <EXPR> )" in your HTCondor-CE configuration, replacing <EXPR> with the expression that you removed from your local condor config in step 1.

This way you follow the CE convention of holding misbehaving jobs, leaving them around for the submitter if necessary, then removing them via SYSTEM_PERIODIC_REMOVE after a grace period.

- Brian

On 5/20/20 11:25 AM, Brian Lin wrote:
Hi Stefano,

Thanks for running those queries! If EnteredCurrentStatus is constantly updating, the jobs must be cycling between job states frequently. Could you pick one of these held jobs and run `grep <JOB ID> /var/lib/condor-ce/spool/job_queue.log` as root?

Also, could you send me the entirety of your SchedLog and `condor_ce_config_val -dump` off list?

Thanks,
Brian

On 5/19/20 1:25 AM, Stefano Dal Pra wrote:
Hello Brian,
In my last mail i was considering a 16 hours time limit. Comparing according to that:

# x509UserProxyExpiration limit find 8 entries:

[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 16 * 3600)' -- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 05/19/20 08:02:10
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0
ops046 ID: 80       5/18 01:35      _      _      _      1 1 80.0
ops046 ID: 81       5/18 03:35      _      _      _      1 1 81.0

Total for query: 8 jobs; 0 completed, 0 removed, 0 idle, 0 running, 8 held, 0 suspended Total for all users: 20 jobs; 6 completed, 0 removed, 0 idle, 0 running, 14 held, 0 suspended

# EnteredCurrentStatus limit find None.

[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - EnteredCurrentStatus > 16 * 3600)' -- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 05/19/20 08:02:30 OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 20 jobs; 6 completed, 0 removed, 0 idle, 0 running, 14 held, 0 suspended

It seems to me that EnteredCurrentStatus in this case is updated regularly and never become old enough:

[root@ce01-lhcb-t2 ~]# date
mar 19 mag 2020, 08.20.54, CEST
[root@ce01-lhcb-t2 ~]# condor_ce_q 68 69 72 73 76 77 80 81 -af 'abstime(EnteredCurrentStatus)'
absTime("2020-05-19T05:32:24+02:00")
absTime("2020-05-19T05:32:24+02:00")
absTime("2020-05-19T05:30:24+02:00")
absTime("2020-05-19T05:30:24+02:00")
absTime("2020-05-19T05:27:23+02:00")
absTime("2020-05-19T05:27:23+02:00")
absTime("2020-05-19T05:26:22+02:00")
absTime("2020-05-19T05:26:22+02:00")

Cheers,
Stefano


Il 18/05/20 23:49, Brian Lin ha scritto:
Hi Stefano,

In addition to the time difference in the system periodic remove vs your queries, you're also comparing two different attributes against the current time: EnteredCurrentStatus and x509UserProxyExpiration, respectively. I'm curious what you see from the following query:

# condor_ce_q -cons '(JobStatus == 5 ) && (time() - EnteredCurrentStatus > 2 * 3600)'

Could you use that list of jobs for your analysis of the SchedLog?

Thanks,
Brian

On 5/18/20 4:27 PM, Stefano Dal Pra wrote:
Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there much longer anyway.

I have set the debug level as you said (on a brand new CE working with  "ops" jobs only until now). I  also reduced the remove policy to 2 hours (to be sure there is something to purge). Before reconfiguring i selected jobid on hold for more than 16 hours and found  12 such jobs:

[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 16 * 3600)' -- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @ 05/18/20 22:27:32 OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD TOTAL JOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1 1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1 1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1 1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1 1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1 1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1 1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0

After a condor_ce_reconfig  (i actually did a restart too) a few of them are gone, and a few are still there:

[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.0 72.0 73.0 74.0 75.0 76.0 77.0 -- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 05/18/20 22:58:16 OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0

Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running, 6 held, 0 suspended Total for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0 running, 10 held, 0 suspended

The SchedLog after reconfig has, for job 71.0 (this has been removed):

 05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Remove log_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to user logfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log owner=ops046 05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened /var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log successfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.

Looking for job 68.0 (not removed) however, there is nothing after reconfiguration time (22:30):

05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Hold log_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to user logfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log owner=ops046 05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened /var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log successfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueue act_on_job_myself_queue is empty, not resetting timer

And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0


-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 05/18/20 23:23:49 OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0

Cheers
Stefano



Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,

I'm a little confused, your system periodic remove expressions seem to remove jobs that have been held for more than 8 hours whereas your queries are looking for held jobs whose proxies have been expired for more than 4 hours. I imagine there's some overlap but they seem like fairly different queries.

Though having the RemoveReason set that ways is pretty strange. If you set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hints in the SchedLog as to why the Schedd is failing to remove these jobs.

Thanks,
Brian

On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.

We have a problem common to all of our CEs:

[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5

I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*8) SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed by SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*8), "being in the hold state for 8 hours.", ifThenElse((JobStatus == 5 && isUndefined(RoutedToJobId)), "non-existent route or entry in JOB_ROUTER_ENTRIES.", "input files missing." ) ) )

Inspecting these "non purged jobs", they have a RemoveReason set, but they are not gone nevertheless:

[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the hold state for 8 hours.

Until now i have no better way than removing these jobs manually using somethin like: condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 4 * 3600)' -af 'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm

Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/





_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/