[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor with AMQP



Matt

I have resolved my "scheduling issue".  But I have more questions if you don't mind.

What resolved my issue was adjusting these settings

carod.conf queued_connections = 30  

condor_config.local LL_Daemon_queued_connections = 30:

We only have 13 servers in our cluster with 108 total CPU's or nodes.  When I changed these two setting things are flying.  Its fast like "Talladega fast"

So my question is what should we expect to change these setting to if we increase or decrease our cluster are there a recommended setting?

What made me chose 30 from the 5 that they were set to was looking at the qpid-stat -c I saw that there were 27 connections, so I figure ok lets try 30 and see what happens.  It was all  good things from there!  I ran a 5 second sleep job for 5000 nodes it flew through.  Previously I would see if I submitted say 100 sleep jobs for 5 seconds maybe half would take off and it would take a long time for the jobs to complete.  The only change has been to update the two queued connections settings in the carod.conf and he condor_config.conf.

Thanks a lot!
Allen


-----Original Message-----
From: Matthew Farrellee [mailto:matt@xxxxxxxxxx] 
Sent: Monday, September 27, 2010 8:56 PM
To: Condor-Users Mail List
Cc: Berg, Allen
Subject: Re: [Condor-users] Condor with AMQP

Allen,

You can find documentation for Rob's work at (search for low latency)

	http://www.redhat.com/mrg/grid/

A description of it in a Condor Week 2008 presentation at (around slide 7)

	http://www.cs.wisc.edu/condor/CondorWeek2008/condor_presentations/trieloff_redhat.pdf

Or just the open code at

	http://git.fedorahosted.org/git/?p=grid/carod.git

Rob also follows this list.

Best,


matt

On 09/22/2010 04:48 PM, Berg, Allen wrote:
> Matt
>
> I am defiantly interested where can I find out more information on Rob Rati's implementation?
>
> We have a client that submits jobs from windows to the condor cluster.  The cluster is a all Linux machines RHEL 5.4.
> What I learned this morning is that Ganglia was actually the root cause of the problem.  I moved the master to a machine that did not have Ganglia and the RRDtool installed, and everything seems to work fine.
>
> We are currently using a single generic queue for all the machines without any classads and using job hooks.
>
> What I still find interesting is if I submit say 150 sleep jobs and I have 56 nodes available it seems that say 46 nodes will take off and run a batch then the number or running nodes drops to about half the started nodes then it will drop again until all the jobs complete the number of nodes continually declines in usage.
>
> It just seems to be odd behavior I would expect all nodes to pick up and start working until all jobs were completed.
>
> Thanks
> Allen
>
>
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew Farrellee
> Sent: Tuesday, September 21, 2010 10:42 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Condor with AMQP
>
> Allen,
>
> Are you using the Startd's work fetch (/job hooks) functionality to pull
> work from the messaging queues?
>
> If so, you should be looking at the execute nodes to see what the
> problem may be. It might be helpful to describe the mechanism you're
> using to deliver the work to Startds some more. Rob Rati implemented
> just such a system on top of Qpid, which you might be interested in.
>
> Best,
>
>
> matt
>
> On 09/20/2010 04:17 PM, Shahaan Ayyub wrote:
>>
>> Allen,
>> I have never worked with Qpid but it seems from having a quick look at
>> the documentation that it simply provides a high level interface, in
>> your case, to condor. I am amazed as to why native condor commands are
>> not working? Otherwise you might have to look for a wrapper around
>> native condor commands.
>> Sorry couldn't be of much help to you.
>> Regards,
>> Shahaan
>> On 21/09/2010, at 5:03 AM, Shahaan Ayyub<shahaan@xxxxxxxxx
>> <mailto:shahaan@xxxxxxxxx>>  wrote:
>>
>>> Hi Allen,
>>> What does condor_q -better-analyze say for different timestamps, i.e.
>>> when some of the jobs are held whilst some of them are still
>>> running/completed.
>>>
>>> Regards,
>>> Shahaan
>>>
>>> On 21/09/2010, at 3:07 AM, "Berg, Allen"<
>>> <mailto:aberg@xxxxxxxx>aberg@xxxxxxxx<mailto:aberg@xxxxxxxx>>  wrote:
>>>
>>>> We have a relatively small condor cluster its fifteen machines with a
>>>> total of 140 cpus.
>>>>
>>>> We have implemented it using Apache Qpid Daemon is installed on the
>>>> master node. This package provides the queue “server”. It is the
>>>> facility that provides message queuing to the cluster. The Apache
>>>> Qpid API for C++ is installed on each cluster node.
>>>>
>>>> What I am seeing that I have questions about is that when I submit
>>>> say two jobs very simple just a sleep command for two of the nodes.
>>>> The first job will take off and run, the second job will sit there
>>>> for possibly 20 minutes before it times out. Within any of the condor
>>>> logs I am not seeing any errors or any indications of weirdness. Then
>>>> if I run a larger test of say 40 jobs to sleep for 5 seconds, I would
>>>> expect that when I send the 40 jobs in they would all be picked up
>>>> and run completing in a reasonable amount of time. What I really see
>>>> is maybe 20 jobs take off, then 12 will start then maybe 8 and the
>>>> last few will complete. How can I find/learn out how the queue
>>>> actually performing and what can I do to better tune the queue.
>>>>
>>>> Thanks
>>>>
>>>> Allen
>>>>
>>>> This message and any enclosures are intended only for the addressee.  Please
>>>> notify the sender by email if you are not the intended recipient.  If you are
>>>> not the intended recipient, you may not use, copy, disclose, or distribute this
>>>> message or its contents or enclosures to any other person and any such actions
>>>> may be unlawful.  Ball reserves the right to monitor and review all messages
>>>> and enclosures sent to or from this email address.
>>>> _______________________________________________
>>>> Condor-users mailing list
>>>> To unsubscribe, send a message to
>>>> <mailto:condor-users-request@xxxxxxxxxxx>condor-users-request@xxxxxxxxxxx
>>>> <mailto:condor-users-request@xxxxxxxxxxx>  with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> <https://lists.cs.wisc.edu/mailman/listinfo/condor-users>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>
>>>> The archives can be found at:
>>>> <https://lists.cs.wisc.edu/archive/condor-users/>https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
>
> This message and any enclosures are intended only for the addressee.  Please
> notify the sender by email if you are not the intended recipient.  If you are
> not the intended recipient, you may not use, copy, disclose, or distribute this
> message or its contents or enclosures to any other person and any such actions
> may be unlawful.  Ball reserves the right to monitor and review all messages
> and enclosures sent to or from this email address.
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/




This message and any enclosures are intended only for the addressee.  Please  
notify the sender by email if you are not the intended recipient.  If you are  
not the intended recipient, you may not use, copy, disclose, or distribute this  
message or its contents or enclosures to any other person and any such actions  
may be unlawful.  Ball reserves the right to monitor and review all messages  
and enclosures sent to or from this email address.