[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Partitionable slots and the parallel universe in 7.8
- Date: Wed, 24 Oct 2012 09:51:46 -0400 (EDT)
- From: Tim St Clair <tstclair@xxxxxxxxxx>
- Subject: Re: [Condor-users] Partitionable slots and the parallel universe in 7.8
Could you please post your complete submission file?
It appears that it's failing when trying to quantize convert either
also I typically put
machine_count = (some hard # like 2)
----- Original Message -----
> From: "Mark Calleja" <mc321@xxxxxxxxx>
> To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
> Sent: Wednesday, October 24, 2012 5:35:39 AM
> Subject: [Condor-users] Partitionable slots and the parallel universe in 7.8
> I've been trying to get MPI jobs using the parallel universe to work
> with partitionable slots in 7.8.[3-5], but with no success. My tests
> currently limited to single hosts using ParallelSchedulingGroups, and
> should be noted that using static slots instead works. Has anyone got
> the former set up to work?
> My job sets both machine_count and request_cpus in the submit file,
> when I set machine_count = request_cpus > 1, then the job always
> to run with the following message in the StartLog of the matched
> host (the machine has 4 cores):
> 10/24/12 11:04:22 slot1: match_info called
> 10/24/12 11:04:22 slot1: Received match
> 10/24/12 11:04:22 slot1: State change: match notification protocol
> 10/24/12 11:04:22 slot1: Changing state: Unclaimed -> Matched
> 10/24/12 11:04:22 Job no longer matches partitionable slot after
> MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
> 10/24/12 11:04:22 slot1: Partitionable slot can't be split to
> allocate a
> dynamic slot large enough for the claim
> 10/24/12 11:04:22 slot1: State change: claiming protocol failed
> 10/24/12 11:04:22 slot1: Changing state: Matched -> Owner
> 10/24/12 11:04:22 slot1: State change: IS_OWNER is false
> 10/24/12 11:04:22 slot1: Changing state: Owner -> Unclaimed
> On the other hand, if I set machine_count to the required number of
> cores but set request_cpus to 1 then the job runs (which is not
> but on finishing the slot remains in Claimed/Idle state, and one sees
> the following messages in the StartLog:
> Starter pid XXXX exited with status 2
> Warning: Starter pid XXXX is not associated with an claim. A slot may
> fail to transition to Idle.
> I've tried with both setting my own JOB_DEFAULT_REQUEST[ MEMORY,
> CPUS ] values and leaving them set to the defaults, but no joy.
> MUST_MODIFY_REQUEST_EXPRS is left unset, which defaults to False.
> tried this is all under Debian 6.0.6.
> Any help of how to get this to work would be appreciated.
> Best regards,
> ps. I know that one can use the vanilla universe for single host MPI
> with partitionable slots, and it works, but we need this for
> compatibility reasons.
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: