[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster observation (bug possibly ?)



Hi Ian,

To my dismay, I see that you are right about the negotiator matching too many offline machines when given a small number of jobs. This is a result of a bug, and I expect to have it fixed in 7.4.3.

You also noted that when there are no more jobs, rooster is still waking up machines due to MachineLastMatchTime =!= undefined. In your situation, I suggest requiring that the last match time be more recent than some cut-off value. Example:

Unhibernate = CurrentTime - MachineLastMatchTime < 1200

You can tighten up the cutoff if your negotiator and rooster cycles are sufficiently fast. The cut-off should be at least larger than the shorter of the rooster and negotiator cycle times. A factor of 2 seems reasonable.

Let me know if that doesn't help.

--Dan

Smith, Ian wrote:
Dear All,

I'm still beavering away with condor_rooster on our power-saving Windows
Condor pool and it seems to be going pretty well apart from a couple of
glitches.

One thing I've noticed is that if I submit a relatively small number of jobs
then the negotiator sends a MERGE_STARTD_AD for >each offline machine<
to condor_rooster. Then condor_rooster attempts to wake up >all< of the
offline machines. Obviously for a largish pool this could easily result in
several hundred machines being woken up just to run a handful of jobs which
clearly isn't very energy efficient. I've worked around this by limiting the number
of machines woken up on each cycle and ensuring that the absolute limit
is equal to the number of idle jobs (clearly there is no point waking more machines than there are jobs to run on them). Also I've randomly shuffled
the matched offline machines so condor doesn't repeatedly try to wakeup
the same ones (some of which could be unreachable, powered off, kaput etc etc).

Is there a more elegant workaround to this or would it need changes to the
Condor code. I wonder if something could be included in the Unhibernate
expression which is set by default to:

Unhibernate = MachineLastMatchTime =!= UNDEFINED

On to the second point. I've noticed that if for example I have a large number
of jobs queued and I suddenly remove most/all of them then the negotiatior
and condor_rooster don't seem to be aware of this and carry on trying
to wakeup machines as before. If I invalidate and re-advertise the offline
machine ClassAds then then everything sorts itself out. I can only assume
that the offline ClassAds are in someway "stale" and need to be updated
regularly. To get around this I re-advertise them every 10 minutes (the condor_rooster
cycle is also 10 minutes). Again I can't help wondering if there is a better
way of doing this.

regards,

-ian.


--------------------------------------------
Dr Ian C. Smith,
e-Science Team,
The University of Liverpool,
Computing Services Department


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/