[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rooster observation (bug possibly ?)



This seems to have fixed the problem.

thanks,

-ian.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: 13 April 2010 15:55
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_rooster observation (bug possibly ?)
> 
> Hi Ian,
> 
> To my dismay, I see that you are right about the negotiator matching too
> many offline machines when given a small number of jobs.  This is a
> result of a bug, and I expect to have it fixed in 7.4.3.
> 
> You also noted that when there are no more jobs, rooster is still waking
> up machines due to MachineLastMatchTime =!= undefined.  In your
> situation, I suggest requiring that the last match time be more recent
> than some cut-off value.  Example:
> 
> Unhibernate = CurrentTime - MachineLastMatchTime < 1200
> 
> You can tighten up the cutoff if your negotiator and rooster cycles are
> sufficiently fast.  The cut-off should be at least larger than the
> shorter of the rooster and negotiator cycle times.  A factor of 2 seems
> reasonable.
> 
> Let me know if that doesn't help.
> 
> --Dan
> 
> Smith, Ian wrote:
> > Dear All,
> >
> > I'm still beavering away with condor_rooster on our power-saving Windows
> > Condor pool and it seems to be going pretty well apart from a couple of
> > glitches.
> >
> > One thing I've noticed is that if I submit a relatively small number of jobs
> > then the negotiator sends a MERGE_STARTD_AD for >each offline machine<
> > to condor_rooster. Then condor_rooster attempts to wake up >all< of the
> > offline machines. Obviously for a largish pool this could easily result in
> > several hundred machines being woken up just to run a handful of jobs which
> > clearly isn't very energy efficient. I've worked around this by limiting the number
> > of machines woken up on each cycle and ensuring that the absolute limit
> > is equal to the number of idle jobs (clearly there is no point waking more
> > machines than there are jobs to run on them).  Also I've randomly shuffled
> > the matched offline machines so condor doesn't repeatedly try to wakeup
> > the same ones (some of which could be unreachable, powered off, kaput etc etc).
> >
> > Is there a more elegant workaround to this or would it need changes to the
> > Condor code. I wonder if something could be included in the Unhibernate
> > expression which is set by default to:
> >
> > Unhibernate = MachineLastMatchTime =!= UNDEFINED
> >
> > On to the second point. I've noticed that if for example I have a large number
> > of jobs queued and I suddenly remove most/all of them then the negotiatior
> > and condor_rooster don't seem to be aware of this and carry on trying
> > to wakeup machines as before. If I invalidate and re-advertise the offline
> > machine ClassAds then then everything sorts itself out. I can only assume
> > that the offline ClassAds are in someway "stale" and need to be updated
> > regularly. To get around this I re-advertise them every 10 minutes (the
> condor_rooster
> > cycle is also 10 minutes). Again I can't help wondering if there is a better
> > way of doing this.
> >
> > regards,
> >
> > -ian.
> >
> >
> > --------------------------------------------
> > Dr Ian C. Smith,
> > e-Science Team,
> > The University of Liverpool,
> > Computing Services Department
> >
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/