[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD

Hi Michael:

We have run into similar problems on this end - partitionable slots seem optimal for our general use-cases: lots of vanilla job and a fair amount of parallel jobs each using a number of processors/memory.

There were a couple of suggestions for running these two models simultaneously in Condor in a thread I started back in August (some require quite a bit more tinkering than the others). See https://lists.cs.wisc.edu/archive/condor-users/2010-August/msg00229.shtml

Right now, I am waiting in joyful anticipation for the closing of ticket #986 - see the contents here https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=986,0

In our case, 'RequestCpus' is the important aspect of parallel jobs - users want to be able to specify RequestCpus=8, num_machines=2 to receive 8 processors per node on two nodes.

I assume from your problem statement that the memory required per process for either the parallel or vanilla jobs is larger than the default memory value of 8GB assigned per slot in the non-partitionable configuration (64GB total/8 processors per machine). Is this correct?


On 12/29/2010 09:19 AM, Michael Hanke wrote:

On Tue, Dec 28, 2010 at 09:01:25PM -0500, Michael Hanke wrote:
Looks like I am facing two problems: 1. The job is not successfully
scheduled in the first place and 2. schedd crashed.

Problem 1. was my fault (I had another START expression in a config.d
file that failed when TARGET.RequestCpus was undefined. Problem 2.
remains unsolved (but doesn't happen anymore when considering changes
described below).

So, although initially matched the job is gets rejected in the end.  I
cannot figure out what 'machine requirements' aren't satisfied.
Submitting the same job as vanilla work like charm. I suspect it has
something to do with the slot configuration (see below).

Looking further that doesn't seem to be entirely true. Indeed, if I
remove 'SLOT_TYPE_1_PARTITIONABLE = TRUE' from the node configuration.
My parallel jobs get scheduled and run fine. However, even with
partitionable slots they run IF the submit file doesn't specify
'RequestMemory' or 'RequestCpus'.

While 'RequestCpus' is probably not that useful for parallel job anyway,
I think 'RequestMemory' is essential for partitionable slots. What
happens with a 'RequestMemory' statement in the submit file is that the
dedicated scheduler claims slot (a partition) but doesn't honor the
requested memory size. When condor tries to run the job on the claimed
slot partition it fails, because the slot has insufficient memory.

Now I start to wonder: Maybe I'm trying to force a setup that isn't
optimal/supported/problematic? From reading the manual I got the
impression that partitionable slots are the most appropriate for our use
cases. We have machines with 8 CPUs and 64GB each. We need to run many
single CPU vanilla jobs, but also multi-threaded tools, as well as
MPI-like parallel jobs. The necessary amount of memory varies from
little to HUGE. All these machines are dedicated cluster nodes. Having a
single partitionable slot where resources can be allocated according to
a job's requirements looked like the most flexible.

Can anybody provide some advice on how to ensure peaceful co-existence
of vanilla and parallel jobs on the same machines -- given that it has
enough resources?

Thanks in advance,