HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] cpu-seconds in startd slot stats?



Hi Brian,

The number of cpus associated with a parent partitionable slot is not constant. That is the crux of the matter. If no jobs are running, it may have all the cpus on the machine. If a full load of jobs is running, it may have 0 cpus.

TotalTimeUnclaimedIdle for the partitionable slot as it is currently defined will just tell me how long the slot has been in the Unclaimed state. Since the parent partitionable slot is never claimed, this doesn't seem particularly interesting.

If we changed TotalTimeUnclaimedIdle to be cpu-time, then it would tell me something that I think is more useful: how much of the time cpus have been unclaimed.

Now, one might also want to know how much of the time cpus have been in the Claimed state. My proposal doesn't solve that. Currently, there is no aggregation of stats from the child slots to the parent partitionable slot. So when a child slot becomes unclaimed and its resources are subsumed by the parent slot, its history is lost. That seems like something that should also be addressed. I think the child totals should be added to the parent.

--Dan

p.s. I agree with you about the danger of changing semantics. There is also the creeping bloat of the ClassAd to worry about.

On 11/2/11 2:05 PM, Brian Bockelman wrote:
Hi Dan,

I don't particularly like this - anytime one changes the meaning of an attribute, someone pops up a few months later complaining they were using the original value.  Sometimes that someone is me :) [but not in this case]

What value does this add if SlotWeight is also available in the ad?

Brian

On Nov 2, 2011, at 1:58 PM, Dan Bradley wrote:

Currently, the startd advertises attributes such as TotalTimeUnclaimedIdle in each slot ad.  The unit is seconds.

For partitionable slots, I think this would be more useful if it were seconds times the number of cpus allocated to the slot.  For consistency, I propose that we make this change for all slots, not just partitionable slots.

One question is whether it should be cpu-seconds or SlotWeight-seconds.  Since SlotWeight defaults to number of cpus, it would be the same thing by default.  Changes to the number of cpus associated with a slot happen only in specific places in our code, whereas the value of SlotWeight could change at any time and so would require appropriate treatment.

Thoughts?

--Dan

_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel