Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster observation (bug possibly ?)

Date: Tue, 13 Apr 2010 09:54:46 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] condor_rooster observation (bug possibly ?)

Hi Ian,

To my dismay, I see that you are right about the negotiator matching toomany offline machines when given a small number of jobs. This is aresult of a bug, and I expect to have it fixed in 7.4.3.

You also noted that when there are no more jobs, rooster is still wakingup machines due to MachineLastMatchTime =!= undefined. In yoursituation, I suggest requiring that the last match time be more recentthan some cut-off value. Example:


Unhibernate = CurrentTime - MachineLastMatchTime < 1200

You can tighten up the cutoff if your negotiator and rooster cycles aresufficiently fast. The cut-off should be at least larger than theshorter of the rooster and negotiator cycle times. A factor of 2 seemsreasonable.


Let me know if that doesn't help.

--Dan

Smith, Ian wrote:

Dear All,

I'm still beavering away with condor_rooster on our power-saving Windows
Condor pool and it seems to be going pretty well apart from a couple of
glitches.

One thing I've noticed is that if I submit a relatively small number of jobs
then the negotiator sends a MERGE_STARTD_AD for >each offline machine<
to condor_rooster. Then condor_rooster attempts to wake up >all< of the
offline machines. Obviously for a largish pool this could easily result in
several hundred machines being woken up just to run a handful of jobs which
clearly isn't very energy efficient. I've worked around this by limiting the number
of machines woken up on each cycle and ensuring that the absolute limit

is equal to the number of idle jobs (clearly there is no point waking moremachines than there are jobs to run on them). Also I've randomly shuffled

the matched offline machines so condor doesn't repeatedly try to wakeup
the same ones (some of which could be unreachable, powered off, kaput etc etc).

Is there a more elegant workaround to this or would it need changes to the
Condor code. I wonder if something could be included in the Unhibernate
expression which is set by default to:

Unhibernate = MachineLastMatchTime =!= UNDEFINED

On to the second point. I've noticed that if for example I have a large number
of jobs queued and I suddenly remove most/all of them then the negotiatior
and condor_rooster don't seem to be aware of this and carry on trying
to wakeup machines as before. If I invalidate and re-advertise the offline
machine ClassAds then then everything sorts itself out. I can only assume
that the offline ClassAds are in someway "stale" and need to be updated
regularly. To get around this I re-advertise them every 10 minutes (the condor_rooster
cycle is also 10 minutes). Again I can't help wondering if there is a better
way of doing this.

regards,

-ian.


--------------------------------------------
Dr Ian C. Smith,
e-Science Team,
The University of Liverpool,
Computing Services Department


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Follow-Ups:
- Re: [Condor-users] condor_rooster observation (bug possibly ?)
  - From: Smith, Ian

References:
- [Condor-users] condor_rooster observation (bug possibly ?)
  - From: Smith, Ian

Prev by Date: Re: [Condor-users] getting an email when a job goes on hold?
Next by Date: Re: [Condor-users] getting an email when a job goes on hold?
Previous by thread: [Condor-users] condor_rooster observation (bug possibly ?)
Next by thread: Re: [Condor-users] condor_rooster observation (bug possibly ?)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] condor_rooster observation (bug possibly ?)