[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Partitionable slots and the parallel universe in 7.8
- Date: Wed, 24 Oct 2012 11:35:39 +0100
- From: Mark Calleja <mc321@xxxxxxxxx>
- Subject: [Condor-users] Partitionable slots and the parallel universe in 7.8
I've been trying to get MPI jobs using the parallel universe to work
with partitionable slots in 7.8.[3-5], but with no success. My tests are
currently limited to single hosts using ParallelSchedulingGroups, and it
should be noted that using static slots instead works. Has anyone got
the former set up to work?
My job sets both machine_count and request_cpus in the submit file, and
when I set machine_count = request_cpus > 1, then the job always fails
to run with the following message in the StartLog of the matched execute
host (the machine has 4 cores):
10/24/12 11:04:22 slot1: match_info called
10/24/12 11:04:22 slot1: Received match
10/24/12 11:04:22 slot1: State change: match notification protocol
10/24/12 11:04:22 slot1: Changing state: Unclaimed -> Matched
10/24/12 11:04:22 Job no longer matches partitionable slot after
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
10/24/12 11:04:22 slot1: Partitionable slot can't be split to allocate a
dynamic slot large enough for the claim
10/24/12 11:04:22 slot1: State change: claiming protocol failed
10/24/12 11:04:22 slot1: Changing state: Matched -> Owner
10/24/12 11:04:22 slot1: State change: IS_OWNER is false
10/24/12 11:04:22 slot1: Changing state: Owner -> Unclaimed
On the other hand, if I set machine_count to the required number of
cores but set request_cpus to 1 then the job runs (which is not ideal)
but on finishing the slot remains in Claimed/Idle state, and one sees
the following messages in the StartLog:
Starter pid XXXX exited with status 2
Warning: Starter pid XXXX is not associated with an claim. A slot may
fail to transition to Idle.
I've tried with both setting my own JOB_DEFAULT_REQUEST[ MEMORY, DISK,
CPUS ] values and leaving them set to the defaults, but no joy.
MUST_MODIFY_REQUEST_EXPRS is left unset, which defaults to False. I've
tried this is all under Debian 6.0.6.
Any help of how to get this to work would be appreciated.
ps. I know that one can use the vanilla universe for single host MPI
with partitionable slots, and it works, but we need this for backwards