[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



Thanks for the responses.

What should be the ratio between a scheduler:running jobs,
scheduler:jobs in the queue ?  Even during the week when we have 1100
jobs per scheduler I start to see '????', this can't be right. I spoke
to our Network Gurus and they see nothing wrong with the network and
the servers are functioning perfectly fine. I did a bandwidth test and
I was able to push and pull 1TB in mins on our 10G link.




On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
> There have been discussions about putting some statistical monitors into
> different daemons, including the Schedd, which would be direct indicators of
> backup etc. However, that's just talk at the moment.
>
> From a practical point of view, a few options are:
>  0) You can use the knowledge that the Negotiator puts a Slot into the
> Matched state, then Schedd then puts it into the Claimed/Idle state and then
> into Claimed/Busy, and observe that a large number of Matched slots can
> indicate backup. Claimed/Idle can also be an indicator.
>  1) You can look at the exit status of shadows in the Schedd's log and why
> the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND
> INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd.
>  2) You can look at response time to a condor_q query that should return no
> result and wouldn't initiate a O(n) scan of the queue.
>  3) Looking at the history file (with condor_history), you can compare the
> CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone
> attributes. Any drift can indicate a backup in the Schedd.
>  4) Compare the reported count of running jobs in the Schedd (condor_q |
> tail -n1) to the condor_status -schedd. Some small drift is acceptable.
>  I'll just stop here. 8o)
>
> If the jobs aren't running condor_q -run should not pick them up and show
> you ???s in the first place.
>
> Best,
>
>
> matt
>
> Mag Gam wrote:
>>
>> is there a way to see if the schedd is backed up? How can I see the
>> real status of it?
>>
>> It seems when I submit many jobs (even not running), I get this problem.
>>
>>
>> On Wed, May 26, 2010 at 2:16 PM, Matthew Farrellee <matt@xxxxxxxxxx>
>> wrote:
>>>
>>> On 05/25/2010 09:05 PM, Mag Gam wrote:
>>>>
>>>> Previously, I had 1 scheduler and 1 collector for 2000 nodes (each
>>>> with 16 core) giving me 32000 slots. Everything was functioning fine,
>>>> however I used to get a lot of '???????' when I did condor_q -run .
>>>>
>>>> Recently, I added an extra scheduler and a collector to complement my
>>>> previous scheduler. I noticed the '?????' is completely gone! Â I was
>>>> wondering if there was a relation between this problem and having an
>>>> extra collector and scheduler in my pool.
>>>
>>> It's entirely possible that the "???"s were because your Schedd was
>>> backed up. It could have marked the jobs as running but info about the
>>> Startd where the job was running had not made its way back yet.
>>>
>>> Best,
>>>
>>>
>>> matt
>>>
>
>