[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] offline compute nodes and Rooster
- Date: Fri, 15 Oct 2010 14:12:27 +0100
- From: Paul Haldane <paul.haldane@xxxxxxxxxxxxxxx>
- Subject: [Condor-users] offline compute nodes and Rooster
Me again - I just need to check my understanding of how power management and Rooster should work.
This is 7.4.3 on a Linux central collector and 7.4.2 on Windows 7 compute nodes.
Behaviour I'm seeing is that compute nodes aren't being powered up to service queued jobs. I submitted a batch yesterday evening (after all the machines had gone to sleep). Nothing in RoosterLog to indicate that
the system requested WoL of any workers. condor_status didn't show any nodes (though entries were written to offline.log for the hibernating machines). Doing some more experimentation shows that compute nodes appear in condor_status for a while (10-20 minutes) after the machines hibernate.
Can I just check my understanding of what should be happening ...
1. Condor on the compute nodes sends ADs to collector when Offline becomes true (idle and not claimed for at least a minute). This information is stored in offline.log. This bit is working as I expect.
2. If condor was in control of power Condor on the compute node would then put itself to sleep. We don't use that functionality; instead some other process does the hibernation (with logic to not hibernate if Condor is running a job).
3. Offline slots _should_ (I think they should, but would like confirmation) continue to appear in the output of condor_status (using -constraint Offline to just see offline slots). In our environment they only appear for 10/20 minutes after powering off. This isn't what I expect because OFFLINE_EXPIRE_ADS_AFTER defaults to maxint.
4. Hibernating compute nodes should be woken up by Rooster on the collector - it will only wake nodes which are visible in condor_status (again that's what I think - am I right?). This doesn't work for us because the offline nodes are only visible for a short time after node hibernates.
What I've observed here is that if Condor decides that it needs a node's resource in the short time between it becoming Offline and disappearing from condor_status then it will try to wake it (often it's already awake).
(a) is my mental model right? If not please point me at the right docs (I might be just missing something obvious - just like the last problem I was having).
(b) Is the step that's missing in our environment the hibernation under condor's control. Do the condor daemons at that point send a message to collector saying "please remember me while I'm asleep"?
Information Systems and Services