[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] offline compute nodes and Rooster

> From: Paul Haldane
> Sent: 16 October 2010 14:24
> > > > 3.  Offline slots _should_ (I think they should, but would like
> > > confirmation) continue to appear in the output of condor_status (using
> > > -constraint Offline to just see offline slots).  In our environment
> > > they only appear for 10/20 minutes after powering off.  This isn't what
> > > I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.
> > >
> > > Yes, the offline ads should remain visible in condor_status.  They
> > > should not expire in 30 minutes if you are using the default
> I've just been able to grab (using condor_status -l
> yard10.campus.ncl.ac.uk) the ADS for a machine that's unpingable (so it
> is hibernating) but still visible in condor_status output.
> I won't include all 109 lines of output here (unless that would be
> useful - full version is at
> http://www.staff.ncl.ac.uk/paul.haldane/yard10.txt).  All looks
> plausible to me apart from
> Offline = ((CurrentTime - EnteredCurrentState) >= 60 &&
>         MachineLastMatchTime =?= UNDEFINED && State =?= "Unclaimed")
> Is that correct or should it just be a simple Boolean value?
> I know why it's showing that value ("Offline = $(ShouldHibernate)" in
> the config file on the compute nodes) but perfectly willing to believe
> that it's rubbish.

I've made progress on a couple of fronts.

1. Realised that we'd changed ROOSTER_UNHIBERNATE to a daft setting.

We had

 ROOSTER_UNHIBERNATE = Unhibernate && Offline =?= False

... which I don't think would ever match.  Changing it to the default value of 

 ROOSTER_UNHIBERNATE = Unhibernate && Offline == True

... worked better but because I don't think we're setting Unhibernate properly yet I've currently got 


2. Hacked together a script using condor_advertise to publish ADS for offline machines.  This works and with the sensible setting for ROOSTER_UNHIBERNATE leads to hibernating machines being woken up by Rooster to service jobs.   Remaining problem was that the ADS disappeared after about 20 minutes.  Bit more poking around took me back to Ian's message to the list (https://lists.cs.wisc.edu/archive/condor-users/2010-January/msg00148.shtml).  Adding ClassAdLifetime to the published AD seems to have done the trick (at least the test machine has stayed visible for over 25 minutes).

I think this leads me to what I need to do to make this work without the external script - just need to set ClassAdLifetime on the compute nodes to something useful (currently using 2 days).  Once we've got that I think the only remaining issue is our power management system sometimes not noticing Condor activity (but that's definitely a local problem).