[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Partitionable slots and the parallel universe in 7.8



Mark - 

Could you please post your complete submission file?  

It appears that it's failing when trying to quantize convert either 

request_memory
request_cpus
or request_disk

also I typically put 

machine_count = (some hard # like 2) 

Cheers,
Tim

----- Original Message -----
> From: "Mark Calleja" <mc321@xxxxxxxxx>
> To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
> Sent: Wednesday, October 24, 2012 5:35:39 AM
> Subject: [Condor-users] Partitionable slots and the parallel universe in 7.8
> 
> Hi,
> 
> I've been trying to get MPI jobs using the parallel universe to work
> with partitionable slots in 7.8.[3-5], but with no success. My tests
> are
> currently limited to single hosts using ParallelSchedulingGroups, and
> it
> should be noted that using static slots instead works. Has anyone got
> the former set up to work?
> 
> My job sets both machine_count and request_cpus in the submit file,
> and
> when I set machine_count = request_cpus > 1, then the job always
> fails
> to run with the following message in the StartLog of the matched
> execute
> host (the machine has 4 cores):
> 
> 10/24/12 11:04:22 slot1: match_info called
> 10/24/12 11:04:22 slot1: Received match
> <172.24.116.41:57793>#1351071964#38#...
> 10/24/12 11:04:22 slot1: State change: match notification protocol
> successful
> 10/24/12 11:04:22 slot1: Changing state: Unclaimed -> Matched
> 10/24/12 11:04:22 Job no longer matches partitionable slot after
> MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
> 10/24/12 11:04:22 slot1: Partitionable slot can't be split to
> allocate a
> dynamic slot large enough for the claim
> 10/24/12 11:04:22 slot1: State change: claiming protocol failed
> 10/24/12 11:04:22 slot1: Changing state: Matched -> Owner
> 10/24/12 11:04:22 slot1: State change: IS_OWNER is false
> 10/24/12 11:04:22 slot1: Changing state: Owner -> Unclaimed
> 
> On the other hand, if I set machine_count to the required number of
> cores but set request_cpus to 1 then the job runs (which is not
> ideal)
> but on finishing the slot remains in Claimed/Idle state, and one sees
> the following messages in the StartLog:
> 
> Starter pid XXXX exited with status 2
> Warning: Starter pid XXXX is not associated with an claim. A slot may
> fail to transition to Idle.
> 
> I've tried with both setting my own JOB_DEFAULT_REQUEST[ MEMORY,
> DISK,
> CPUS ] values and leaving them set to the defaults, but no joy.
> MUST_MODIFY_REQUEST_EXPRS is left unset, which defaults to False.
> I've
> tried this is all under Debian 6.0.6.
> 
> Any help of how to get this to work would be appreciated.
> 
> Best regards,
> Mark
> 
> ps. I know that one can use the vanilla universe for single host MPI
> with partitionable slots, and it works, but we need this for
> backwards
> compatibility reasons.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>