Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] offline compute nodes and Rooster

Date: Fri, 15 Oct 2010 10:57:40 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] offline compute nodes and Rooster



On 10/15/10 8:12 AM, Paul Haldane wrote:

Me again - I just need to check my understanding of how power management and Rooster should work.

This is 7.4.3 on a Linux central collector and 7.4.2 on Windows 7 compute nodes.

Behaviour I'm seeing is that compute nodes aren't being powered up to service queued jobs. I submitted a batch yesterday evening (after all the machines had gone to sleep).  Nothing in RoosterLog to indicate that
the system requested WoL of any workers.  condor_status didn't show any nodes (though entries were written to offline.log for the hibernating machines).  Doing some more experimentation shows that compute nodes appear in condor_status for a while (10-20 minutes) after the machines hibernate.

Can I just check my understanding of what should be happening ...

1. Condor on the compute nodes sends ADs to collector when Offline becomes true (idle and not claimed for at least a minute).  This information is stored in offline.log.  This bit is working as I expect.

Are you sure this part is working? I'm worried that if Condor tries andfails to hibernate the machine that it may send an ad that is not an"Offline" ad and this will remove the Offline ad from the persistent store.

When ads are removed from the persistent store, you should see a line inoffline.log beginning with 102, which is the DestroyClassAd command.

2. If condor was in control of power Condor on the compute node would then put itself to sleep. We don't use that functionality; instead some other process does the hibernation (with logic to not hibernate if Condor is running a job).

3.  Offline slots _should_ (I think they should, but would like confirmation) continue to appear in the output of condor_status (using -constraint Offline to just see offline slots).  In our environment they only appear for 10/20 minutes after powering off.  This isn't what I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.

Yes, the offline ads should remain visible in condor_status. Theyshould not expire in 30 minutes if you are using the defaultOFFLINE_EXPIRE_ADS_AFTER.

4. Hibernating compute nodes should be woken up by Rooster on the collector - it will only wake nodes which are visible in condor_status (again that's what I think - am I right?).  This doesn't work for us because the offline nodes are only visible for a short time after node hibernates.

Yes, you are right. If you are using the default unhibernateexpression, what should happen is that some job will get matched to oneof the offline ads and this will result in MachineLastMatchTime gettingupdated. Once that happens, the machine's Unhibernate expression shouldbecome true, which should cause Rooster to try to wake it up.

What I've observed here is that if Condor decides that it needs a node's resource in the short time between it becoming Offline and disappearing from condor_status then it will try to wake it (often it's already awake).

(a) is my mental model right?  If not please point me at the right docs (I might be just missing something obvious - just like the last problem I was having).

(b) Is the step that's missing in our environment the hibernation under condor's control.  Do the condor daemons at that point send a message to collector saying "please remember me while I'm asleep"?

Yes. Directly before hibernation, condor_startd sends an Offline ad tothe collector, which is basically "please remember me while I'masleep". Any ClassAd sent by the startd that is not an offline ad willremove the persistent Offline ad, on the assumption that the machine hasnow woken back up. This is unfortunate, because it doesn't very wellsupport external hibernation.

It is possible to generate offline ads with condor_advertise (by settingOffline=True), so you could generate the offline ads that way, once youobserve that the machine has gone away. This is what Ian Smith has doneusing an external script that he wrote.


--Dan

Follow-Ups:
- Re: [Condor-users] offline compute nodes and Rooster
  - From: Paul Haldane

References:
- [Condor-users] offline compute nodes and Rooster
  - From: Paul Haldane

Prev by Date: [Condor-users] offline compute nodes and Rooster
Next by Date: [Condor-users] Forcing process to run in condor-reuse-slot user context
Previous by thread: [Condor-users] offline compute nodes and Rooster
Next by thread: Re: [Condor-users] offline compute nodes and Rooster
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] offline compute nodes and Rooster