[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Partitionable slots and the parallel universe in 7.8



Hi Tim,

Here's one I'm using:

##########
Universe = parallel
Executable = openmpi_wrapper.sh
arguments = mb_3.1.2-amd64-MPI mb.txt1
transfer_input_files  = mb_3.1.2-amd64-MPI, mb.txt1, shorthorses.nex
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
notification = never

Requirements = OpSys == "LINUX" && Arch =="X86_64"

machine_count = 2
request_cpus = 2

+WantParallelSchedulingGroups = True

Output = OUT
Log = LOG
Error = ERR
Queue
##########

Best regards,
Mark

On 24/10/2012 14:51, Tim St Clair wrote:
Mark -

Could you please post your complete submission file?

It appears that it's failing when trying to quantize convert either

request_memory
request_cpus
or request_disk

also I typically put

machine_count = (some hard # like 2)

Cheers,
Tim

----- Original Message -----
From: "Mark Calleja" <mc321@xxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Sent: Wednesday, October 24, 2012 5:35:39 AM
Subject: [Condor-users] Partitionable slots and the parallel universe in 7.8

Hi,

I've been trying to get MPI jobs using the parallel universe to work
with partitionable slots in 7.8.[3-5], but with no success. My tests
are
currently limited to single hosts using ParallelSchedulingGroups, and
it
should be noted that using static slots instead works. Has anyone got
the former set up to work?

My job sets both machine_count and request_cpus in the submit file,
and
when I set machine_count = request_cpus > 1, then the job always
fails
to run with the following message in the StartLog of the matched
execute
host (the machine has 4 cores):

10/24/12 11:04:22 slot1: match_info called
10/24/12 11:04:22 slot1: Received match
<172.24.116.41:57793>#1351071964#38#...
10/24/12 11:04:22 slot1: State change: match notification protocol
successful
10/24/12 11:04:22 slot1: Changing state: Unclaimed -> Matched
10/24/12 11:04:22 Job no longer matches partitionable slot after
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
10/24/12 11:04:22 slot1: Partitionable slot can't be split to
allocate a
dynamic slot large enough for the claim
10/24/12 11:04:22 slot1: State change: claiming protocol failed
10/24/12 11:04:22 slot1: Changing state: Matched -> Owner
10/24/12 11:04:22 slot1: State change: IS_OWNER is false
10/24/12 11:04:22 slot1: Changing state: Owner -> Unclaimed

On the other hand, if I set machine_count to the required number of
cores but set request_cpus to 1 then the job runs (which is not
ideal)
but on finishing the slot remains in Claimed/Idle state, and one sees
the following messages in the StartLog:

Starter pid XXXX exited with status 2
Warning: Starter pid XXXX is not associated with an claim. A slot may
fail to transition to Idle.

I've tried with both setting my own JOB_DEFAULT_REQUEST[ MEMORY,
DISK,
CPUS ] values and leaving them set to the defaults, but no joy.
MUST_MODIFY_REQUEST_EXPRS is left unset, which defaults to False.
I've
tried this is all under Debian 6.0.6.

Any help of how to get this to work would be appreciated.

Best regards,
Mark

ps. I know that one can use the vanilla universe for single host MPI
with partitionable slots, and it works, but we need this for
backwards
compatibility reasons.