[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setup with dynamic slots runs only one job, ever



Version 7.4 does indeed remedy this problem, and dynamic slots are working as expected, which is terrific! I didn't see any mention of such a significant bug in any release notes, which is what gave me the comfort to try 7.2. I did my testing on a test pool of FC12 machines.

Building 7.4 on FC8 was a rather involved and elaborate process, since the rpm itself was not extractable due to some big/little-endian style word boundary issues. Using an FC12 machine to generate the build commands and moving them to FC8, plus lots of hacking of the configuration (involving krb5, vm-gahp, libvirt, gsoap), got us there in the end. Our migration to FC12 is not for a couple months, so I wanted to have a 7.4-on-FC8 alternative lined up.

Thank you.

Greg

On Feb 19, 2010, at 9:08 AM, Matthew Farrellee wrote:

> On 02/18/2010 04:49 PM, Greg Langmead wrote:
>> I have a small cluster with one submit node, and two slave machines.
>> One slave has 4 cores, the other has 8. All are red hat FC8, and
>> condor is version 7.2.1. The config on the slaves looks like this:
>> 
>> SLOT_TYPE_1 = cpus=100%, ram=100%, swap=50%, disk=100%
>> NUM_SLOTS_TYPE_1 = 1
>> SLOT_TYPE_1_PARTITIONABLE = True
>> 
>> If I run "condor_restart -all", and then "condor_run hostname" then
>> it runs and I get a machine name back from hostname. A dynamic slot
>> was synthesized to run it, which I can observe by adding "sleep 60"
>> to the command, to give me time to look. Then the dynamic slot goes
>> away. After this happens, doing "condor_run hostname" remains idle
>> forever for "unknown reasons" (even with better-analyze).
>> 
>> I tried comparing condor_status -l on the two machines before and
>> after "condor_run hostname" had run, and one value that changed
>> afterwards is that VirtualMemory changed from 60000000 to -1 on the
>> machine that ran hostname. I thought that might be the problem, but
>> the second machine, which didn't run hostname, still has its full
>> VirtualMemory being reported, but it doesn't run the job either.
>> 
>> In another scenario, I created a submit file to run hostname, with
>> "queue 200" at the bottom. What happens when I submit it is that each
>> machine spawns one dynamic slot and each of those dynamic slots runs
>> one job until all 200 are finished. Even if I add "sleep 600" to the
>> job, so that a negotiator interval or two has to go by before the job
>> is done, no more than one slot is ever synthesized on either machine.
>> I feel that each machine should spawn up to its TotalCpus or Cpus,
>> which are both 4 on one machine and 8 on the other.
>> 
>> Any ideas how to debug this? Where is the decision to synthesize a
>> dynamic slot being logged? The NegotiatorLog simply says
>> 
>> 2/18 13:39:09     Request 02718.00000:
>> 2/18 13:39:09       Rejected 2718.0 glangmead@xxxxxxxxxxxxxxxxxx <192.168.129.20:48105>: no match found
>> 
>> Thanks,
>> Greg Langmead
>> Senior Research Scientist
>> Language Weaver, Inc.
> 
> I think this is just a bug in 7.2.1, fixed in later releases.
> 
> Can you try the 7.4 series? It might have been fixed during 7.2, but there's hopefully no reason for you to stay with 7.2. If you're on 7.2 because it is the only one available from the Fedora repos I'm sorry because I can't build a newer condor for anything before F10 at this point. F8 is a year past its EOL.
> 
> However, on F8 you should be able to grab the srpm from
> 
> http://koji.fedoraproject.org/koji/buildinfo?buildID=149532
> 
> and run rpmbuild --rebuild condor-7.4.1-1.fc12.src.rpm
> 
> Best,
> 
> 
> matt