[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



There have been discussions about putting some statistical monitors into different daemons, including the Schedd, which would be direct indicators of backup etc. However, that's just talk at the moment.

From a practical point of view, a few options are:
0) You can use the knowledge that the Negotiator puts a Slot into the Matched state, then Schedd then puts it into the Claimed/Idle state and then into Claimed/Busy, and observe that a large number of Matched slots can indicate backup. Claimed/Idle can also be an indicator. 1) You can look at the exit status of shadows in the Schedd's log and why the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd. 2) You can look at response time to a condor_q query that should return no result and wouldn't initiate a O(n) scan of the queue. 3) Looking at the history file (with condor_history), you can compare the CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone attributes. Any drift can indicate a backup in the Schedd. 4) Compare the reported count of running jobs in the Schedd (condor_q | tail -n1) to the condor_status -schedd. Some small drift is acceptable.
 I'll just stop here. 8o)

If the jobs aren't running condor_q -run should not pick them up and show you ???s in the first place.

Best,


matt

Mag Gam wrote:
is there a way to see if the schedd is backed up? How can I see the
real status of it?

It seems when I submit many jobs (even not running), I get this problem.


On Wed, May 26, 2010 at 2:16 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
On 05/25/2010 09:05 PM, Mag Gam wrote:
Previously, I had 1 scheduler and 1 collector for 2000 nodes (each
with 16 core) giving me 32000 slots. Everything was functioning fine,
however I used to get a lot of '???????' when I did condor_q -run .

Recently, I added an extra scheduler and a collector to complement my
previous scheduler. I noticed the '?????' is completely gone! Â I was
wondering if there was a relation between this problem and having an
extra collector and scheduler in my pool.
It's entirely possible that the "???"s were because your Schedd was backed up. It could have marked the jobs as running but info about the Startd where the job was running had not made its way back yet.

Best,


matt