[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] rooster on linux, take 3

On 11/28/2011 09:18 AM, Dan Bradley wrote:

> No.  The rooster daemon is currently not configurable via
> condor_config_val.  You will need to modify the configuration file and
> run condor_reconfig.

That ought to be in the manual: the error message if you try to use
condor_config_val is not helpful to put it mildly (WARNING: Potential
security problem, request refused).

> I'm skeptical about the truth of the statement in the manual.  In a
> quick glance through the code, I don't see any suppression of
> hibernation for an hour after it wakes up.  I could have overlooked it,
> but I've made a note to verify the behavior.

Well, what I see here is sleeping machine isn't getting matched by the
negotiator for some reason. If I wake it up manually it runs jobs for 5
minutes (HIBERNATE_CHECK_INTERVAL = 300) and then shuts down again. Its
sleep state is S4 (as far as condor is concerned, it looks like a full
shutdown to me), that 1 hour period should apply and indeed does not
seem to.

Which probably wouldn't be a problem if the negotiator kept the machine
busy, but that isn't happening. So far I found only one way to match
that machine to a job (and have rooster wake it up): specifically
request TARGET.Machine in job submit file.

So the next question is how do I figure out what's up with the negotiator?

(E.g.) with 40 cores busy and 4 cores sleeping condor_q -analyze 961082

-- Submitter: minnow.bmrb.wisc.edu :
<> : minnow.bmrb.wisc.edu
961082.000:  Run analysis summary.  Of 44 machines,
      4 match but are currently offline
      0 are available to run your job
        No successful match recorded.
        Last failed match: Fri Nov 25 18:18:55 2011
        Reason for last match failure: no match found

NegotiatorLog (on D_FULLDEBUG) is not very informative as to why the "4
matching but offline" cores are not a "successful match":

11/25/11 18:17:55     Sending SEND_JOB_INFO/eom
11/25/11 18:17:55     Getting reply from schedd ...
11/25/11 18:17:55     Got JOB_INFO command; getting classad/eom
11/25/11 18:17:55     Request 961082.00000:
11/25/11 18:17:55 matchmakingAlgorithm: limit 4.000000 used 0.000000
pieLeft 4.000000
11/25/11 18:17:55       Rejected 961082.0 bbee@xxxxxxxxxxxxx
<>: no match found

Dimitri Maziuk
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment: signature.asc
Description: OpenPGP digital signature