[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] rooster on linux, take 2

On 11/21/11 6:05 PM, Dimitri Maziuk wrote:
I guess my original question remains unanswered: how do you tell rooster
to wake up a node?

I am getting

11/21/11 16:49:59 WARNING: Someone at is trying to modify
11/21/11 16:49:59 WARNING: Potential security problem, request refused

from condor_config_val despite the

ALLOW_WRITE = *.bmrb.wisc.edu

Setting configuration settings remotely requires additional configuration settings to allow it. Example:


The above example allows changing configuration settings in a running daemon's memory (i.e. not saved permanently to disk). I believe you have to run condor_reconfig after making the change to make it take effect.

Presumably, you are doing this before hibernating the machine? Because after the machine goes into hibernation, you obviously can't modify its configuration.

Submitting 10K jobs is not a usable debug technique. Besides, it worked
once, but now RoosterLog again claims

"Got 0 startd ads matching ROOSTER_UNHIBERNATE=Offline&&  Unhibernate"

in spite of the full queue

Done     Pre   Queued    Post   Ready   Un-Ready   Failed
===     ===      ===     ===     ===        ===      ===
230       0       41       4    8187          0        0


954838.000:  Run analysis summary.  Of 44 machines,
       4 match but are currently offline

The default Unhibernate expression should be true if MachineLastMatchTime is set. MachineLastMatchTime should get set when the negotiator matches a job to an offline machine.

To see if the negotiator is matching jobs to offline machines, add D_FULLDEBUG to NEGOTIATOR_DEBUG. You should then see the following message in NegotiatorLog:

"Registering attempt to match offline machine <host.name> by <user.name>."

This should result in MachineLastMatchTime getting set in the offline machine ad. You should be able to look at the offline machine ad with a command such as this:

condor_status -l <host.name>

All rooster does is periodically query the collector to find machines for which Unhibernate is true. It then uses condor_power to wake them up.

I hope that helps.


p.s. The problem with condor_power has been fixed for 7.6.5. We can give you a pre-release if you want one.