[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE not purging finished jobs



Hi Stefano,

In addition to the time difference in the system periodic remove vs your queries, you're also comparing two different attributes against the current time: EnteredCurrentStatus and x509UserProxyExpiration, respectively. I'm curious what you see from the following query:

# condor_ce_q -cons '(JobStatus == 5 ) && (time() - EnteredCurrentStatus > 2 * 3600)'

Could you use that list of jobs for your analysis of the SchedLog?

Thanks,
Brian

On 5/18/20 4:27 PM, Stefano Dal Pra wrote:
Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there much longer anyway.

I have set the debug level as you said (on a brand new CE working with  "ops" jobs only until now). I  also reduced the remove policy to 2 hours (to be sure there is something to purge). Before reconfiguring i selected jobid on hold for more than 16 hours and found  12 such jobs:

[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 16 * 3600)' -- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @ 05/18/20 22:27:32
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1      1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1      1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1      1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1      1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1      1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1      1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0

After a condor_ce_reconfig  (i actually did a restart too) a few of them are gone, and a few are still there:

[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.0 72.0 73.0 74.0 75.0 76.0 77.0 -- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 05/18/20 22:58:16
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0

Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running, 6 held, 0 suspended Total for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0 running, 10 held, 0 suspended

The SchedLog after reconfig has, for job 71.0 (this has been removed):

 05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Remove log_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to user logfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log owner=ops046 05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened /var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log successfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.

Looking for job 68.0 (not removed) however, there is nothing after reconfiguration time (22:30):

05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Hold log_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to user logfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log owner=ops046 05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened /var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log successfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueue act_on_job_myself_queue is empty, not resetting timer

And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0


-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 05/18/20 23:23:49
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0

Cheers
Stefano



Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,

I'm a little confused, your system periodic remove expressions seem to remove jobs that have been held for more than 8 hours whereas your queries are looking for held jobs whose proxies have been expired for more than 4 hours. I imagine there's some overlap but they seem like fairly different queries.

Though having the RemoveReason set that ways is pretty strange. If you set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hints in the SchedLog as to why the Schedd is failing to remove these jobs.

Thanks,
Brian

On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.

We have a problem common to all of our CEs:

[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5

I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*8) SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed by SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*8), "being in the hold state for 8 hours.", ifThenElse((JobStatus == 5 && isUndefined(RoutedToJobId)), "non-existent route or entry in JOB_ROUTER_ENTRIES.", "input files missing." ) ) )

Inspecting these "non purged jobs", they have a RemoveReason set, but they are not gone nevertheless:

[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the hold state for 8 hours.

Until now i have no better way than removing these jobs manually using somethin like: condor_ce_q -cons '(JobStatus == 5 ) && (time() - x509UserProxyExpiration > 4 * 3600)' -af 'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm

Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/