Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD

Date: Wed, 29 Dec 2010 10:22:42 -0600
From: "David J. Herzfeld" <david.herzfeld@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD

Hi Michael:

We have run into similar problems on this end - partitionable slots seemoptimal for our general use-cases: lots of vanilla job and a fair amountof parallel jobs each using a number of processors/memory.

There were a couple of suggestions for running these two modelssimultaneously in Condor in a thread I started back in August (somerequire quite a bit more tinkering than the others). Seehttps://lists.cs.wisc.edu/archive/condor-users/2010-August/msg00229.shtml

Right now, I am waiting in joyful anticipation for the closing of ticket#986 - see the contents herehttps://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=986,0

In our case, 'RequestCpus' is the important aspect of parallel jobs -users want to be able to specify RequestCpus=8, num_machines=2 toreceive 8 processors per node on two nodes.

I assume from your problem statement that the memory required perprocess for either the parallel or vanilla jobs is larger than thedefault memory value of 8GB assigned per slot in the non-partitionableconfiguration (64GB total/8 processors per machine). Is this correct?


DJH

On 12/29/2010 09:19 AM, Michael Hanke wrote:

Hi,

On Tue, Dec 28, 2010 at 09:01:25PM -0500, Michael Hanke wrote:

Looks like I am facing two problems: 1. The job is not successfully
scheduled in the first place and 2. schedd crashed.


Problem 1. was my fault (I had another START expression in a config.d
file that failed when TARGET.RequestCpus was undefined. Problem 2.
remains unsolved (but doesn't happen anymore when considering changes
described below).

So, although initially matched the job is gets rejected in the end.  I
cannot figure out what 'machine requirements' aren't satisfied.
Submitting the same job as vanilla work like charm. I suspect it has
something to do with the slot configuration (see below).


Looking further that doesn't seem to be entirely true. Indeed, if I
remove 'SLOT_TYPE_1_PARTITIONABLE = TRUE' from the node configuration.
My parallel jobs get scheduled and run fine. However, even with
partitionable slots they run IF the submit file doesn't specify
'RequestMemory' or 'RequestCpus'.

While 'RequestCpus' is probably not that useful for parallel job anyway,
I think 'RequestMemory' is essential for partitionable slots. What
happens with a 'RequestMemory' statement in the submit file is that the
dedicated scheduler claims slot (a partition) but doesn't honor the
requested memory size. When condor tries to run the job on the claimed
slot partition it fails, because the slot has insufficient memory.

Now I start to wonder: Maybe I'm trying to force a setup that isn't
optimal/supported/problematic? From reading the manual I got the
impression that partitionable slots are the most appropriate for our use
cases. We have machines with 8 CPUs and 64GB each. We need to run many
single CPU vanilla jobs, but also multi-threaded tools, as well as
MPI-like parallel jobs. The necessary amount of memory varies from
little to HUGE. All these machines are dedicated cluster nodes. Having a
single partitionable slot where resources can be allocated according to
a job's requirements looked like the most flexible.

Can anybody provide some advice on how to ensure peaceful co-existence
of vanilla and parallel jobs on the same machines -- given that it has
enough resources?

Thanks in advance,

Michael

Follow-Ups:
- Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
  - From: Michael Hanke

References:
- [Condor-users] 'parallel' universe job submission crashes SCHEDD
  - From: Michael Hanke
- Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
  - From: Michael Hanke

Prev by Date: Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
Next by Date: Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
Previous by thread: Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
Next by thread: Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] 'parallel' universe job submission crashes SCHEDD