[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Partitionable slots and the parallel universe in 7.8


I've been trying to get MPI jobs using the parallel universe to work with partitionable slots in 7.8.[3-5], but with no success. My tests are currently limited to single hosts using ParallelSchedulingGroups, and it should be noted that using static slots instead works. Has anyone got the former set up to work?

My job sets both machine_count and request_cpus in the submit file, and when I set machine_count = request_cpus > 1, then the job always fails to run with the following message in the StartLog of the matched execute host (the machine has 4 cores):

10/24/12 11:04:22 slot1: match_info called
10/24/12 11:04:22 slot1: Received match <>#1351071964#38#... 10/24/12 11:04:22 slot1: State change: match notification protocol successful
10/24/12 11:04:22 slot1: Changing state: Unclaimed -> Matched
10/24/12 11:04:22 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits 10/24/12 11:04:22 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim
10/24/12 11:04:22 slot1: State change: claiming protocol failed
10/24/12 11:04:22 slot1: Changing state: Matched -> Owner
10/24/12 11:04:22 slot1: State change: IS_OWNER is false
10/24/12 11:04:22 slot1: Changing state: Owner -> Unclaimed

On the other hand, if I set machine_count to the required number of cores but set request_cpus to 1 then the job runs (which is not ideal) but on finishing the slot remains in Claimed/Idle state, and one sees the following messages in the StartLog:

Starter pid XXXX exited with status 2
Warning: Starter pid XXXX is not associated with an claim. A slot may fail to transition to Idle.

I've tried with both setting my own JOB_DEFAULT_REQUEST[ MEMORY, DISK, CPUS ] values and leaving them set to the defaults, but no joy. MUST_MODIFY_REQUEST_EXPRS is left unset, which defaults to False. I've tried this is all under Debian 6.0.6.

Any help of how to get this to work would be appreciated.

Best regards,

ps. I know that one can use the vanilla universe for single host MPI with partitionable slots, and it works, but we need this for backwards compatibility reasons.