On Wed, Sep 22, 2010 at 4:56 PM, Berg, Allen <aberg@xxxxxxxx>
I am defiantly interested where can I find out more information on Rob Rati's implementation?
Have you seen this?
We have a client that submits jobs from windows to the condor cluster. The cluster is a all Linux machines RHEL 5.4.
What I learned this morning is that Ganglia was actually the root cause of the problem. I moved the master to a machine that did not have Ganglia and the RRDtool installed, and everything seems to work fine.
That's odd. But if "fine" is what you're describing below, that doesn't seem fine to me. :)
What I still find interesting is if I submit say 150 sleep jobs and I have 56 nodes available it seems that say 46 nodes will take off and run a batch then the number or running nodes drops to about half the started nodes then it will drop again until all the jobs complete the number of nodes continually declines in usage.
It just seems to be odd behavior I would expect all nodes to pick up and start working until all jobs were completed.
So the problem might be in one of two places:
It might lie in your hook script or it might lie in how the hook script is being called by Condor.
For a machine that was once running a job, but is now no longer running a job, what does the StartLog look like? I'd be interested to see the log output when it picked up the job via the hook script script and then log output a later date. First: you want to confirm that the hook script runs again after a job is run on the machine. And then you want to see if the script is producing an output, if it's even attempting to run a job on the machine, and with what parameters.