[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] having multiple schedulers and collectors



Hi Mag,

We wrote up details on how to implement multiple schedds in different versions of Condor on our blog/wiki:
http://blog.cyclecomputing.com/

Good luck, and let me know if you have any questions!

Regards,
Rob

On 6/3/10 6:03 AM, Mag Gam wrote:
How does one put multiple schedulers on 1 server? that would be very useful.


On Wed, Jun 2, 2010 at 6:47 AM, Mag Gam<magawake@xxxxxxxxx>  wrote:
Thanks for the responses.

What should be the ratio between a scheduler:running jobs,
scheduler:jobs in the queue ?  Even during the week when we have 1100
jobs per scheduler I start to see '????', this can't be right. I spoke
to our Network Gurus and they see nothing wrong with the network and
the servers are functioning perfectly fine. I did a bandwidth test and
I was able to push and pull 1TB in mins on our 10G link.




On Tue, Jun 1, 2010 at 1:24 PM, Matthew Farrellee<matt@xxxxxxxxxx>  wrote:
There have been discussions about putting some statistical monitors into
different daemons, including the Schedd, which would be direct indicators of
backup etc. However, that's just talk at the moment.

 From a practical point of view, a few options are:
  0) You can use the knowledge that the Negotiator puts a Slot into the
Matched state, then Schedd then puts it into the Claimed/Idle state and then
into Claimed/Busy, and observe that a large number of Matched slots can
indicate backup. Claimed/Idle can also be an indicator.
  1) You can look at the exit status of shadows in the Schedd's log and why
the exits happened in the ShadowLog. Seeing exit code 4 or "FAILED TO SEND
INITIAL KEEP ALIVE TO OUR PARENT" is an indication of a backed up Schedd.
  2) You can look at response time to a condor_q query that should return no
result and wouldn't initiate a O(n) scan of the queue.
  3) Looking at the history file (with condor_history), you can compare the
CompletionDate with the EnteredCurrentStatus and JobFinishedHookDone
attributes. Any drift can indicate a backup in the Schedd.
  4) Compare the reported count of running jobs in the Schedd (condor_q |
tail -n1) to the condor_status -schedd. Some small drift is acceptable.
  I'll just stop here. 8o)

If the jobs aren't running condor_q -run should not pick them up and show
you ???s in the first place.

Best,


matt

Mag Gam wrote:
is there a way to see if the schedd is backed up? How can I see the
real status of it?

It seems when I submit many jobs (even not running), I get this problem.


On Wed, May 26, 2010 at 2:16 PM, Matthew Farrellee<matt@xxxxxxxxxx>
wrote:
On 05/25/2010 09:05 PM, Mag Gam wrote:
Previously, I had 1 scheduler and 1 collector for 2000 nodes (each
with 16 core) giving me 32000 slots. Everything was functioning fine,
however I used to get a lot of '???????' when I did condor_q -run .

Recently, I added an extra scheduler and a collector to complement my
previous scheduler. I noticed the '?????' is completely gone! Â I was
wondering if there was a relation between this problem and having an
extra collector and scheduler in my pool.
It's entirely possible that the "???"s were because your Schedd was
backed up. It could have marked the jobs as running but info about the
Startd where the job was running had not made its way back yet.

Best,


matt


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--

===================================
Rob Futrick
Senior Software Engineer

Cycle Computing, LLC
main: 888.292.5320

Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com