[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] rooster on linux, take 3



Some more info:

I am now testing with a HIBERNATE of just:
HIBERNATE = (NO_ONE_LOGGED_IN =?= True)

and after a couple of hours with no one logged in, it is still powered up.

I copied the work here
<https://twiki.grid.iu.edu/bin/view/Tier3/CondorHawkeyeSetup>.

So I have:

STARTD_CRON_JOBLIST = NOONELOGGEDIN
STARTD_CRON_NOONELOGGEDIN_EXECUTABLE = /etc/condor/local/nooneloggedin.sh
STARTD_CRON_NOONELOGGEDIN_PERIOD = 30s
STARTD_CRON_NOONELOGGEDIN_MODE = Periodic

Is that correct for Condor v7.4.4?

nooneloggedin.sh has been tested independently (as described previously,
below).
-Ian


On 29/11/2011 09:00, "Ian Cottam" <Ian.Cottam@xxxxxxxxxxxxxxxx> wrote:

>Does anyone have a good working of "hibernation/rooster wake up" across a
>reasonable sized pool (i.e. bigger than just one test PC, although even
>that would be interesting)?
>Condor v7.4.4 or higher (below versions were known to have problems I
>believe)?
>Please share configs with us if you do.
>
>I recently added the recommended way of not hibernating if anyone was
>logged in (via a "startd cron" as its sometimes called) and now the test
>machines don't hibernate at all. I have tested that the script generates
>NO_USER_LOGGED_IN = True
>when only Condor has the PC, and test for that in the HIBERNATE
>expression.
>
>Thanks.
>-Ian
>
>
>
>
>
>
>On 28/11/2011 18:44, "Dan Bradley" <dan@xxxxxxxxxxxx> wrote:
>
>>
>>
>>On 11/28/11 12:33 PM, Dimitri Maziuk wrote:
>>> On 11/28/2011 09:18 AM, Dan Bradley wrote:
>>>
>>> So the next question is how do I figure out what's up with the
>>>negotiator?
>>>
>>> (E.g.) with 40 cores busy and 4 cores sleeping condor_q -analyze 961082
>>> says:
>>>
>>> -- Submitter: minnow.bmrb.wisc.edu :
>>> <144.92.167.254:9617?sock=13250_c2fa_3>  : minnow.bmrb.wisc.edu
>>> ---
>>> 961082.000:  Run analysis summary.  Of 44 machines,
>>> ...
>>>        4 match but are currently offline
>>>        0 are available to run your job
>>>          No successful match recorded.
>>>          Last failed match: Fri Nov 25 18:18:55 2011
>>>          Reason for last match failure: no match found
>>> -----------------------------------------------------
>>>
>>> NegotiatorLog (on D_FULLDEBUG) is not very informative as to why the "4
>>> matching but offline" cores are not a "successful match":
>>>
>>> 11/25/11 18:17:55     Sending SEND_JOB_INFO/eom
>>> 11/25/11 18:17:55     Getting reply from schedd ...
>>> 11/25/11 18:17:55     Got JOB_INFO command; getting classad/eom
>>> 11/25/11 18:17:55     Request 961082.00000:
>>> 11/25/11 18:17:55 matchmakingAlgorithm: limit 4.000000 used 0.000000
>>> pieLeft 4.000000
>>> 11/25/11 18:17:55       Rejected 961082.0 bbee@xxxxxxxxxxxxx
>>> <144.92.167.254:9617?sock=13250_c2fa_3>: no match found
>>> --------------------------------------------------------
>>>
>>>
>>
>>
>>If you add D_JOB and D_MACHINE to NEGOTIATOR_DEBUG, you will get verbose
>>logging of every machine considered by the negotiator when trying to
>>match the job.  Is it even considering the offline machine?  If so, and
>>if it matches, I would expect the following to be logged by the
>>negotiator:
>>
>>"Registering attempt to match offline machine <host.name> by
>><user.name>."
>>
>>--Dan
>>
>>_______________________________________________
>>Condor-users mailing list
>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>subject: Unsubscribe
>>You can also unsubscribe by visiting
>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>The archives can be found at:
>>https://lists.cs.wisc.edu/archive/condor-users/
>>
>
>
>-- 
>Ian Cottam
>ext. 61851
>IT Services for Research
>Faculty of Engineering and Physical Sciences
>The University of Manchester
>"The only strategy that is guaranteed to fail is not taking risks." Mark
>Zuckerberg
>
>
>
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/condor-users/
>


-- 
Ian Cottam
ext. 61851
IT Services for Research
Faculty of Engineering and Physical Sciences
The University of Manchester
"The only strategy that is guaranteed to fail is not taking risks." Mark
Zuckerberg