[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



I'll assume the jobs run for >15min, and the schedd shouldn't be backed up. Maybe something else is going on.

Best,


matt

On 06/02/2010 06:47 AM, Mag Gam wrote:
> Thanks for the responses.
> 
> What should be the ratio between a scheduler:running jobs,
> scheduler:jobs in the queue ?  Even during the week when we have 1100
> jobs per scheduler I start to see '????', this can't be right. I spoke
> to our Network Gurus and they see nothing wrong with the network and
> the servers are functioning perfectly fine. I did a bandwidth test and
> I was able to push and pull 1TB in mins on our 10G link.
> 
> 
> 
> 
> On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
>> There have been discussions about putting some statistical monitors into
>> different daemons, including the Schedd, which would be direct indicators of
>> backup etc. However, that's just talk at the moment.
>>
>> From a practical point of view, a few options are:
>>  0) You can use the knowledge that the Negotiator puts a Slot into the
>> Matched state, then Schedd then puts it into the Claimed/Idle state and then
>> into Claimed/Busy, and observe that a large number of Matched slots can
>> indicate backup. Claimed/Idle can also be an indicator.
>>  1) You can look at the exit status of shadows in the Schedd's log and why
>> the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND
>> INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd.
>>  2) You can look at response time to a condor_q query that should return no
>> result and wouldn't initiate a O(n) scan of the queue.
>>  3) Looking at the history file (with condor_history), you can compare the
>> CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone
>> attributes. Any drift can indicate a backup in the Schedd.
>>  4) Compare the reported count of running jobs in the Schedd (condor_q |
>> tail -n1) to the condor_status -schedd. Some small drift is acceptable.
>>  I'll just stop here. 8o)
>>
>> If the jobs aren't running condor_q -run should not pick them up and show
>> you ???s in the first place.
>>
>> Best,
>>
>>
>> matt
>>
>> Mag Gam wrote:
>>>
>>> is there a way to see if the schedd is backed up? How can I see the
>>> real status of it?
>>>
>>> It seems when I submit many jobs (even not running), I get this problem.
>>>
>>>
>>> On Wed, May 26, 2010 at 2:16 PM, Matthew Farrellee <matt@xxxxxxxxxx>
>>> wrote:
>>>>
>>>> On 05/25/2010 09:05 PM, Mag Gam wrote:
>>>>>
>>>>> Previously, I had 1 scheduler and 1 collector for 2000 nodes (each
>>>>> with 16 core) giving me 32000 slots. Everything was functioning fine,
>>>>> however I used to get a lot of '???????' when I did condor_q -run .
>>>>>
>>>>> Recently, I added an extra scheduler and a collector to complement my
>>>>> previous scheduler. I noticed the '?????' is completely gone! Â I was
>>>>> wondering if there was a relation between this problem and having an
>>>>> extra collector and scheduler in my pool.
>>>>
>>>> It's entirely possible that the "???"s were because your Schedd was
>>>> backed up. It could have marked the jobs as running but info about the
>>>> Startd where the job was running had not made its way back yet.
>>>>
>>>> Best,
>>>>
>>>>
>>>> matt
>>>>
>>
>>