[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] offline compute nodes and Rooster

On 10/15/10 8:12 AM, Paul Haldane wrote:
Me again - I just need to check my understanding of how power management and Rooster should work.

This is 7.4.3 on a Linux central collector and 7.4.2 on Windows 7 compute nodes.

Behaviour I'm seeing is that compute nodes aren't being powered up to service queued jobs. I submitted a batch yesterday evening (after all the machines had gone to sleep).  Nothing in RoosterLog to indicate that
the system requested WoL of any workers.  condor_status didn't show any nodes (though entries were written to offline.log for the hibernating machines).  Doing some more experimentation shows that compute nodes appear in condor_status for a while (10-20 minutes) after the machines hibernate.

Can I just check my understanding of what should be happening ...

1. Condor on the compute nodes sends ADs to collector when Offline becomes true (idle and not claimed for at least a minute).  This information is stored in offline.log.  This bit is working as I expect.

Are you sure this part is working? I'm worried that if Condor tries and fails to hibernate the machine that it may send an ad that is not an "Offline" ad and this will remove the Offline ad from the persistent store.

When ads are removed from the persistent store, you should see a line in offline.log beginning with 102, which is the DestroyClassAd command.

2. If condor was in control of power Condor on the compute node would then put itself to sleep. We don't use that functionality; instead some other process does the hibernation (with logic to not hibernate if Condor is running a job).

3.  Offline slots _should_ (I think they should, but would like confirmation) continue to appear in the output of condor_status (using -constraint Offline to just see offline slots).  In our environment they only appear for 10/20 minutes after powering off.  This isn't what I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.

Yes, the offline ads should remain visible in condor_status. They should not expire in 30 minutes if you are using the default OFFLINE_EXPIRE_ADS_AFTER.

4. Hibernating compute nodes should be woken up by Rooster on the collector - it will only wake nodes which are visible in condor_status (again that's what I think - am I right?).  This doesn't work for us because the offline nodes are only visible for a short time after node hibernates.

Yes, you are right. If you are using the default unhibernate expression, what should happen is that some job will get matched to one of the offline ads and this will result in MachineLastMatchTime getting updated. Once that happens, the machine's Unhibernate expression should become true, which should cause Rooster to try to wake it up.

What I've observed here is that if Condor decides that it needs a node's resource in the short time between it becoming Offline and disappearing from condor_status then it will try to wake it (often it's already awake).

(a) is my mental model right?  If not please point me at the right docs (I might be just missing something obvious - just like the last problem I was having).

(b) Is the step that's missing in our environment the hibernation under condor's control.  Do the condor daemons at that point send a message to collector saying "please remember me while I'm asleep"?

Yes. Directly before hibernation, condor_startd sends an Offline ad to the collector, which is basically "please remember me while I'm asleep". Any ClassAd sent by the startd that is not an offline ad will remove the persistent Offline ad, on the assumption that the machine has now woken back up. This is unfortunate, because it doesn't very well support external hibernation.

It is possible to generate offline ads with condor_advertise (by setting Offline=True), so you could generate the offline ads that way, once you observe that the machine has gone away. This is what Ian Smith has done using an external script that he wrote.