Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Partitionable Slot Starvation

Date: Thu, 16 Aug 2012 17:28:41 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Partitionable Slot Starvation


On 8/16/12 2:46 PM, Dan Bradley wrote:


On 8/16/12 12:51 PM, William Strecker-Kellogg wrote:

Hi Dan,

On 08/16/2012 10:42 AM, Dan Bradley wrote:

If the problem was caused by DEFRAG_REQUIREMENTS and/or
DEFRAG_WHOLE_MACHINE_EXPR, the defrag log would indicate so with a
message like the following:

"Drained 0 machines (wanted to drain X machines)."

"Doing nothing, because DEFRAG_MAX_WHOLE_MACHINES=X and there are Y
whole machines."

Right, I'm not seeing that message.


As a sanity check, what numbers do you see in the following line in the
log when defrag starts up or is reconfigured?

"polling interval %ds, DEFRAG_DRAINING_MACHINES_PER_HOUR = %f/hour =
%d/interval + %d/hour + %d/day"


08/15/12 15:07:13 polling interval 90s,
DEFRAG_DRAINING_MACHINES_PER_HOUR = 12.000000/hour = 0/interval +
12/hour + 0

Based on that, I would expect defrag to attempt to drain 12 machinesevery hour. You should see a message in the logs something like this:


"Looking for 12 machines to drain."


I just found a bug in this code :(

The workaround is to set DEFRAG_INTERVAL large enough to avoid thehourly and daily draining rate corrections. If you set it back to thedefault 600 seconds, I would expect the scheduling to work correctly.


Sorry about that!

Here's the bug ticket:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3199

--Dan

If your DEFRAG_INTERVAL were larger so thatDEFRAG_DRAINING_MACHINES_PER_HOUR*DEFRAG_INTERVAL/3600 >= 1, then thedraining attempts would be distributed more evenly throughout the hourrather than all landing at the beginning of the hour. For example,with the default DEFRAG_INTERVAL=600, it should attempt to drain 2machines every 10 minutes.
Anyway, if it is never attempting to drain, then that is unexpected.If it _is_ attempting to drain, then perhaps the attempt is beingrejected for some reason. The log should contain an error message inthis case. Do you see anything?
And what numbers do you see in the most recent log line of thefollowing
form:

"There are currently %d draining and %d whole machines."
08/16/12 12:09:31 There are currently 0 draining and 0 whole machines.
One word of warning: defrag drains the whole startd, partitionableslotsand static slots alike. If you only want it to drain some slots andnot
others, you need to run multiple startds and set DEFRAG_REQUIREMENTS to
only match the slots of the startd to be drained and not the slots of
the other startd.
OK, so do I infer that the defrag will only work on machines where there
is only one whole-machine slot? Or just that it will drain single-core
slots in addition to the partitionable ones?
The latter. When it chooses to drain a machine, it drains all slotson the machine, because that is all that is supported by the currentdraining operation in Condor.
--Dan

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

References:
- [Condor-users] Partitionable Slot Starvation
  - From: William Strecker-Kellogg
- Re: [Condor-users] Partitionable Slot Starvation
  - From: Todd Tannenbaum
- Re: [Condor-users] Partitionable Slot Starvation
  - From: Dan Bradley
- Re: [Condor-users] Partitionable Slot Starvation
  - From: William Strecker-Kellogg
- Re: [Condor-users] Partitionable Slot Starvation
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] Problem with multiple schedds in 7.7+
Next by Date: Re: [Condor-users] Problem with multiple schedds in 7.7+
Previous by thread: Re: [Condor-users] Partitionable Slot Starvation
Next by thread: [Condor-users] Condor 7.6.8 invalid, 7.6.9 released!
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Partitionable Slot Starvation