[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Idle processes and TARGET.Memory



pete wrote:
> I am trying to troubleshoot a condor configuration
> on a multi core system. 16 Cores and 16 GBs of Ram.
> Currently all my jobs are siting idle with the
> default configuration.


Pete,

This was one of the things that I had to get my head around when I was first deploying my pool back in May or June, coming from an antiquated Sun Grid Engine.

What's happening here is that Condor is dividing your 16-core machine into 16 equal-sized slots, each with an equal share of the available memory. Sixteen gig of RAM divided by 16 cores equals 1024, minus what's used by the kernel, and that's where you arrive at 973.

Condor is essentially treating each processor core in the system as if it were a separate single-CPU computer system. That's just how we rolled, back in the 90's. (And we liked it!)

The job you're submitting, as you can see by the condor_q output, is requesting 1465 megabytes of memory. Since this exceeds the available memory for all of the "systems," each of which has one core and ~1GB of memory, Condor thinks your job can't run on any of them.

1   ( TARGET.Memory >= 1465 )         0                   MODIFY TO 973

A quick workaround is to put the following line in your submit description:

request_memory=973

Or do a "condor_qedit RequestMemory 973" to change your pending jobs - I think that's the correct ClassAd name... If you do a "condor_q -long" on one of the jobs and grep for 1465, you'll find the exact job ClassAd you need to modify.

The longer-term solution is to configure dynamic slots. This allows jobs to claim as much of the total system memory as they need, and allows the scheduler to more efficiently manage SMP systems' resources. Under a dynamic slots scenario, you'd wind up with ten of those 1465MB jobs running on the 16-core machine (14650MB out of the system total of 15568MB (16*973)), with no swapping.

For most people, I think dynamic slots are really the best way to deal with multi-core machines in Condor pools.

Greg Thain wrote a "Dynamic Slot Tutorial" slide deck which you'll be able to find with a Google of "condor dynamic slots," and there's details about it in the manual.

One thing to be sure to do when you enable dynamic slots is to set the CLAIM_WORKLIFE to zero, or something rather small. Mine is five minutes. Otherwise if a request_memory=4096 job gets run, and is assigned a 4GB dynamic slot, another job which only needs 10MB can come along and claim that slot after the big job is finished, wasting the other 3.999GB that the scheduler had set aside for the slot. You want to have it more than zero if your users are prone to submit a pile of 1-minute jobs against your best advice, since allowing the next 1-minute job to claim the slot just released by the previous 1-minute job saves the negotiator a chunk of effort.

    -Michael Pelletier.