[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



You should just try it out. Add a SCHEDD2 = $(SCHEDD), SCHEDD2_ARGS = -local-name SCHEDD2, DAEMON_LIST = $(DAEMON_LIST) SCHEDD2, DC_DAEMON_LIST... You'll want to make sure the places where the Schedd touches disk are also unique, e.g. SCHEDD.SCHEDD2.SPOOL = ...

Best,


matt

On 06/03/2010 07:03 AM, Mag Gam wrote:
> How does one put multiple schedulers on 1 server? that would be very useful.
> 
> 
> On Wed, Jun 2, 2010 at 6:47 AM, Mag Gam <magawake@xxxxxxxxx> wrote:
>> Thanks for the responses.
>>
>> What should be the ratio between a scheduler:running jobs,
>> scheduler:jobs in the queue ?  Even during the week when we have 1100
>> jobs per scheduler I start to see '????', this can't be right. I spoke
>> to our Network Gurus and they see nothing wrong with the network and
>> the servers are functioning perfectly fine. I did a bandwidth test and
>> I was able to push and pull 1TB in mins on our 10G link.
>>
>>
>>
>>
>> On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
>>> There have been discussions about putting some statistical monitors into
>>> different daemons, including the Schedd, which would be direct indicators of
>>> backup etc. However, that's just talk at the moment.
>>>
>>> From a practical point of view, a few options are:
>>>  0) You can use the knowledge that the Negotiator puts a Slot into the
>>> Matched state, then Schedd then puts it into the Claimed/Idle state and then
>>> into Claimed/Busy, and observe that a large number of Matched slots can
>>> indicate backup. Claimed/Idle can also be an indicator.
>>>  1) You can look at the exit status of shadows in the Schedd's log and why
>>> the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND
>>> INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd.
>>>  2) You can look at response time to a condor_q query that should return no
>>> result and wouldn't initiate a O(n) scan of the queue.
>>>  3) Looking at the history file (with condor_history), you can compare the
>>> CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone
>>> attributes. Any drift can indicate a backup in the Schedd.
>>>  4) Compare the reported count of running jobs in the Schedd (condor_q |
>>> tail -n1) to the condor_status -schedd. Some small drift is acceptable.
>>>  I'll just stop here. 8o)
>>>
>>> If the jobs aren't running condor_q -run should not pick them up and show
>>> you ???s in the first place.
>>>
>>> Best,
>>>
>>>
>>> matt
>>>
>>> Mag Gam wrote:
>>>>
>>>> is there a way to see if the schedd is backed up? How can I see the
>>>> real status of it?
>>>>
>>>> It seems when I submit many jobs (even not running), I get this problem.
>>>>
>>>>
>>>> On Wed, May 26, 2010 at 2:16 PM, Matthew Farrellee <matt@xxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> On 05/25/2010 09:05 PM, Mag Gam wrote:
>>>>>>
>>>>>> Previously, I had 1 scheduler and 1 collector for 2000 nodes (each
>>>>>> with 16 core) giving me 32000 slots. Everything was functioning fine,
>>>>>> however I used to get a lot of '???????' when I did condor_q -run .
>>>>>>
>>>>>> Recently, I added an extra scheduler and a collector to complement my
>>>>>> previous scheduler. I noticed the '?????' is completely gone! Â I was
>>>>>> wondering if there was a relation between this problem and having an
>>>>>> extra collector and scheduler in my pool.
>>>>>
>>>>> It's entirely possible that the "???"s were because your Schedd was
>>>>> backed up. It could have marked the jobs as running but info about the
>>>>> Startd where the job was running had not made its way back yet.
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> matt
>>>>>
>>>
>>>
>>