[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_rooster observation (bug possibly ?)

Dear All,

I'm still beavering away with condor_rooster on our power-saving Windows
Condor pool and it seems to be going pretty well apart from a couple of

One thing I've noticed is that if I submit a relatively small number of jobs
then the negotiator sends a MERGE_STARTD_AD for >each offline machine<
to condor_rooster. Then condor_rooster attempts to wake up >all< of the
offline machines. Obviously for a largish pool this could easily result in
several hundred machines being woken up just to run a handful of jobs which
clearly isn't very energy efficient. I've worked around this by limiting the number
of machines woken up on each cycle and ensuring that the absolute limit
is equal to the number of idle jobs (clearly there is no point waking more 
machines than there are jobs to run on them).  Also I've randomly shuffled
the matched offline machines so condor doesn't repeatedly try to wakeup
the same ones (some of which could be unreachable, powered off, kaput etc etc).

Is there a more elegant workaround to this or would it need changes to the
Condor code. I wonder if something could be included in the Unhibernate
expression which is set by default to:

Unhibernate = MachineLastMatchTime =!= UNDEFINED

On to the second point. I've noticed that if for example I have a large number
of jobs queued and I suddenly remove most/all of them then the negotiatior
and condor_rooster don't seem to be aware of this and carry on trying
to wakeup machines as before. If I invalidate and re-advertise the offline
machine ClassAds then then everything sorts itself out. I can only assume
that the offline ClassAds are in someway "stale" and need to be updated
regularly. To get around this I re-advertise them every 10 minutes (the condor_rooster
cycle is also 10 minutes). Again I can't help wondering if there is a better
way of doing this.



Dr Ian C. Smith,
e-Science Team,
The University of Liverpool,
Computing Services Department