[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor with AMQP

On Mon, Sep 20, 2010 at 1:07 PM, Berg, Allen <aberg@xxxxxxxx> wrote:

We have a relatively small condor cluster its fifteen machines with a total of 140 cpus.


We have implemented it using Apache Qpid Daemon is installed on the master node.  This package provides the queue “server”.  It is the facility that provides message queuing to the cluster.  The Apache Qpid API for C++ is installed on each cluster node.

Allen, I'll second what Matt said: more details are required. If you are running with job hooks, are you running on Windows? And what does condor_status say about the state of your machines when you think they should be running jobs?
When you say:

What I am seeing that I have questions about is that when I submit say two jobs very simple just a sleep command for two of the nodes.  The first job will take off and run, the second job will sit there for possibly 20 minutes before it times out.

Many questions come to mind: how are you targeting machines with a Qpid-based queue? Are you writing job requirements in to the ads of the jobs in the queue? How are jobs "timing out"? Do you mean they're falling out of the Qpid queue because some time--to-live value has expired or do you mean they're running on a node far longer than they should?

Within any of the condor logs I am not seeing any errors or any indications of weirdness.

Advice is going to depend on your mechanism for fetching jobs. If it's Job Hooks can you provide your condor configuration information for firing the hook script you're using? Does your hook script write messages about what it's doing to stderr? Or does it just write out the matched Ad to stdout? Are you running any additional hook scripts to handle match rejections from the startd? Cleanup after the job ends? Can you see your hook script being run by the startd if you look at the processes on the machine? What's the period you have set for firing the hook script when there's no match for the machine? What about when there's a match?

Then if I run a larger test of say 40 jobs to sleep for 5 seconds, I would expect that when I send the 40 jobs in they would all be picked up and run completing in a reasonable amount of time.  What I really see is maybe 20 jobs take off, then 12 will start then maybe 8 and the last few will complete.   How can I find/learn out how the queue actually performing and what can I do to better tune the queue.

It all depends on how you're pulling jobs off that queue.

- Ian

Cycle Computing, LLC
The Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools