[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD



Hi,

On Tue, Dec 28, 2010 at 09:01:25PM -0500, Michael Hanke wrote:
> Looks like I am facing two problems: 1. The job is not successfully
> scheduled in the first place and 2. schedd crashed.

Problem 1. was my fault (I had another START expression in a config.d
file that failed when TARGET.RequestCpus was undefined. Problem 2.
remains unsolved (but doesn't happen anymore when considering changes
described below).

> So, although initially matched the job is gets rejected in the end.  I
> cannot figure out what 'machine requirements' aren't satisfied.
> Submitting the same job as vanilla work like charm. I suspect it has
> something to do with the slot configuration (see below).

Looking further that doesn't seem to be entirely true. Indeed, if I
remove 'SLOT_TYPE_1_PARTITIONABLE = TRUE' from the node configuration.
My parallel jobs get scheduled and run fine. However, even with
partitionable slots they run IF the submit file doesn't specify
'RequestMemory' or 'RequestCpus'.

While 'RequestCpus' is probably not that useful for parallel job anyway,
I think 'RequestMemory' is essential for partitionable slots. What
happens with a 'RequestMemory' statement in the submit file is that the
dedicated scheduler claims slot (a partition) but doesn't honor the
requested memory size. When condor tries to run the job on the claimed
slot partition it fails, because the slot has insufficient memory.

Now I start to wonder: Maybe I'm trying to force a setup that isn't
optimal/supported/problematic? From reading the manual I got the
impression that partitionable slots are the most appropriate for our use
cases. We have machines with 8 CPUs and 64GB each. We need to run many
single CPU vanilla jobs, but also multi-threaded tools, as well as
MPI-like parallel jobs. The necessary amount of memory varies from
little to HUGE. All these machines are dedicated cluster nodes. Having a
single partitionable slot where resources can be allocated according to
a job's requirements looked like the most flexible.

Can anybody provide some advice on how to ensure peaceful co-existence
of vanilla and parallel jobs on the same machines -- given that it has
enough resources?

Thanks in advance,

Michael

-- 
Michael Hanke
http://mih.voxindeserto.de