[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Question: Missing Startd Statistics for Slot



The JobDuration and JobBusyTime counters should should still work.   

They are global counters, however, not per slot counters. 

So there is no reason to configure STARTD_SLOT_ATTRS in this way.  The counters are for the whole startd, there a NO per-slot counters, so using STARTD_SLOT_ATTRS in this way will just create a whole lot of copies of the same  values.

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Namratha V. Urs via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, February 26, 2024 1:45 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Namratha V. Urs <nurs@xxxxxxxx>
Subject: [HTCondor-users] Question: Missing Startd Statistics for Slot
 

Hi there,

 

I am a developer on the GlideinWMS project and we are currently looking into implementing a blackhole detection mechanism for glideins. There had been some conversation/discussion about this back in 2018 and I have been referring to those notes that were made available internally within my team since I’ve been working on enabling this feature in GlideinWMS. All the details I describe next are based off of that. 

 

We have the following lines in our condor configuration:

STARTD.STATISTICS_TO_PUBLISH_LIST = $(STATISTICS_TO_PUBLISH_LIST) JobDuration, JobBusyTime

STARTD_SLOT_ATTRS = RecentJobBusyTimeAvg, RecentJobBusyTimeCount

 

The notes seemed to convey that there are 16 attributes generated in each slot because of two statistics probes (JobDuration, JobBusyTime). While these attributes are not published by default (due to their number), their publishing can be enabled by adding the first line in the code snippet to the configuration of the execute nodes. Having said that, as per my understanding, using the STARTD_SLOT_ATTRS should enable two attributes per slot -- slot<N>_RecentJobBusyTimeAvg and slot<N>_RecentJobBusyTimeCount depending on the type of slot (fixed vs. partitionable). However, I do not see these two attributes in the classad when I query the classad using the command: `condor_status -l <slot1@glidein> | grep -i “job”` on the client side. 

 

I wanted to reach out to understand if I’m missing something and/or learn if things have changed in HTCondor since 2018 (which is when the initial discussion about the blackhole mechanism took place between GlideinWMS and HTCondor teams). If you need further information about anything that I’ve described above, please let me know and I’ll be happy to share.

 

Looking forward to your reply.

 

 

 

Thanks,

 

Namratha Urs (she/her)

Software Developer, Scientific Compute Services and Tools

Computational Science and AI Directorate, Fermi National Accelerator Laboratory

Ph.D. Candidate, Computer Science | University of North Texas