[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] offline compute nodes and Rooster



> From: Dan Bradley
> Sent: 15 October 2010 16:58
> 
> On 10/15/10 8:12 AM, Paul Haldane wrote:
> > Me again - I just need to check my understanding of how power
> > management and Rooster should work.
> >
> > This is 7.4.3 on a Linux central collector and 7.4.2 on Windows 7
> > compute nodes.
> >
> > Behaviour I'm seeing is that compute nodes aren't being powered up to
> > service queued jobs. I submitted a batch yesterday evening (after all
> > the machines had gone to sleep).  Nothing in RoosterLog to indicate that
> > the system requested WoL of any workers.  condor_status didn't show
> > any nodes (though entries were written to offline.log for the
> > hibernating machines).  Doing some more experimentation shows that
> > compute nodes appear in condor_status for a while (10-20 minutes) after
> > the machines hibernate.
> >
> > Can I just check my understanding of what should be happening ...
> >
> > 1. Condor on the compute nodes sends ADs to collector when Offline
> > becomes true (idle and not claimed for at least a minute).  This
> > information is stored in offline.log.  This bit is working as I expect.
> 
> Are you sure this part is working?  I'm worried that if Condor tries and
> fails to hibernate the machine that it may send an ad that is not an
> "Offline" ad and this will remove the Offline ad from the persistent
> store.
> 
> When ads are removed from the persistent store, you should see a line
> in offline.log beginning with 102, which is the DestroyClassAd command.

What I see is ...

105 
102 <YARD19.campus.ncl.ac.uk,10.15.0.55>
101 <YARD19.campus.ncl.ac.uk,10.15.0.55> Machine Job
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> Name "YARD19.campus.ncl.ac.uk"
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> Rank 0.000000
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> CpuBusy ((LoadAvg - CondorLoadAvg) >= 0.500000)
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> SlotWeight Cpus
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> Unhibernate MY.MachineLastMatchTime =!= UNDEFINED
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> MyCurrentTime 1287199971
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> Machine "YARD19.campus.ncl.ac.uk"
...
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> Offline ((CurrentTime - EnteredCurrentState) >= 60 && MachineLast
MatchTime =?= UNDEFINED && State =?= "Unclaimed")
...
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> TimeToLive 2147483647
...
<lots more "103" lines.
...
103 <YARD19.campus.ncl.ac.uk,10.15.0.55> UpdatesHistory "0x00000000000000000000000000000000"
106

That's timed at 04:32 (GMT) shortly before yard19 hibernated.  Is that "forget everything you already know about me and use this info instead"?  Is that what we should be getting?  And if it is, why might these nodes not be persisting?  

Just in case it makes a difference we're using Quill on the master.

Looking just now I see a couple of entries of the form

105 
102 <YARD10.campus.ncl.ac.uk,10.15.0.73>
106

Which are for machines which have just been woken up by people sitting at the machine.  That makes sense - "forget everything you know (and use the live data)".

> > 2. If condor was in control of power Condor on the compute node would
> then put itself to sleep. We don't use that functionality; instead some
> other process does the hibernation (with logic to not hibernate if
> Condor is running a job).
> >
> > 3.  Offline slots _should_ (I think they should, but would like
> confirmation) continue to appear in the output of condor_status (using
> -constraint Offline to just see offline slots).  In our environment
> they only appear for 10/20 minutes after powering off.  This isn't what
> I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.
> 
> Yes, the offline ads should remain visible in condor_status.  They
> should not expire in 30 minutes if you are using the default
> OFFLINE_EXPIRE_ADS_AFTER.
> 
> > 4. Hibernating compute nodes should be woken up by Rooster on the
> collector - it will only wake nodes which are visible in condor_status
> (again that's what I think - am I right?).  This doesn't work for us
> because the offline nodes are only visible for a short time after node
> hibernates.
> 
> Yes, you are right.  If you are using the default unhibernate
> expression, what should happen is that some job will get matched to one
> of the offline ads and this will result in MachineLastMatchTime getting
> updated.  Once that happens, the machine's Unhibernate expression
> should become true, which should cause Rooster to try to wake it up. 
> 
> > What I've observed here is that if Condor decides that it needs a
> node's resource in the short time between it becoming Offline and
> disappearing from condor_status then it will try to wake it (often it's
> already awake).
> >
> > (a) is my mental model right?  If not please point me at the right
> docs (I might be just missing something obvious - just like the last
> problem I was having).
> >
> > (b) Is the step that's missing in our environment the hibernation
> under condor's control.  Do the condor daemons at that point send a
> message to collector saying "please remember me while I'm asleep"?
> 
> Yes.  Directly before hibernation, condor_startd sends an Offline ad to
> the collector, which is basically "please remember me while I'm
> asleep".  Any ClassAd sent by the startd that is not an offline ad will
> remove the persistent Offline ad, on the assumption that the machine
> has now woken back up.  This is unfortunate, because it doesn't very well
> support external hibernation.
> 
> It is possible to generate offline ads with condor_advertise (by setting
> Offline=True), so you could generate the offline ads that way, once you
> observe that the machine has gone away.  This is what Ian Smith has
> done using an external script that he wrote.

I'd come across Ian's paper during my research - it sounds like we can apply similar techniques using condor_advertise if necessary.

Paul