[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setup with dynamic slots runs only one job, ever



On 02/18/2010 04:49 PM, Greg Langmead wrote:
> I have a small cluster with one submit node, and two slave machines.
> One slave has 4 cores, the other has 8. All are red hat FC8, and
> condor is version 7.2.1. The config on the slaves looks like this:
> 
> SLOT_TYPE_1 = cpus=100%, ram=100%, swap=50%, disk=100%
> NUM_SLOTS_TYPE_1 = 1
> SLOT_TYPE_1_PARTITIONABLE = True
> 
> If I run "condor_restart -all", and then "condor_run hostname" then
> it runs and I get a machine name back from hostname. A dynamic slot
> was synthesized to run it, which I can observe by adding "sleep 60"
> to the command, to give me time to look. Then the dynamic slot goes
> away. After this happens, doing "condor_run hostname" remains idle
> forever for "unknown reasons" (even with better-analyze).
> 
> I tried comparing condor_status -l on the two machines before and
> after "condor_run hostname" had run, and one value that changed
> afterwards is that VirtualMemory changed from 60000000 to -1 on the
> machine that ran hostname. I thought that might be the problem, but
> the second machine, which didn't run hostname, still has its full
> VirtualMemory being reported, but it doesn't run the job either.
> 
> In another scenario, I created a submit file to run hostname, with
> "queue 200" at the bottom. What happens when I submit it is that each
> machine spawns one dynamic slot and each of those dynamic slots runs
> one job until all 200 are finished. Even if I add "sleep 600" to the
> job, so that a negotiator interval or two has to go by before the job
> is done, no more than one slot is ever synthesized on either machine.
> I feel that each machine should spawn up to its TotalCpus or Cpus,
> which are both 4 on one machine and 8 on the other.
> 
> Any ideas how to debug this? Where is the decision to synthesize a
> dynamic slot being logged? The NegotiatorLog simply says
> 
> 2/18 13:39:09     Request 02718.00000:
> 2/18 13:39:09       Rejected 2718.0 glangmead@xxxxxxxxxxxxxxxxxx <192.168.129.20:48105>: no match found
> 
> Thanks,
> Greg Langmead
> Senior Research Scientist
> Language Weaver, Inc.

I think this is just a bug in 7.2.1, fixed in later releases.

Can you try the 7.4 series? It might have been fixed during 7.2, but there's hopefully no reason for you to stay with 7.2. If you're on 7.2 because it is the only one available from the Fedora repos I'm sorry because I can't build a newer condor for anything before F10 at this point. F8 is a year past its EOL.

However, on F8 you should be able to grab the srpm from

http://koji.fedoraproject.org/koji/buildinfo?buildID=149532

and run rpmbuild --rebuild condor-7.4.1-1.fc12.src.rpm

Best,


matt