[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] trying to get parallel-universe jobs working



It looks like this was tickling a bug in 7.2.1.  I've upgraded to 7.4.2
and the problem appears to have gone away.

nomad

Lee Damon wrote:
> One of the users here has decided he wants to run MPI jobs.  In trying
> to set up the parallel universe I can't even get the simple "sleep 30"
> job from
> <http://www.cs.wisc.edu/condor/manual/v7.2/2_9Parallel_Applications.html>
> to launch, it just sits idle.
> 
> I've followed the instructions in
> <http://www.cs.wisc.edu/condor/manual/v7.2/3_13Setting_Up.html#sec:Config-Dedicated-Jobs>
> and have the following configuration values set on the 8 test hosts.
> Vanilla universe jobs submitted on these hosts run just fine.  Parallel
> universe jobs just sit idle.
> 
>  -----
> : || nomad@flock03 ~ [77] ; condor_config_val STARTD_ATTRS
> RESOURCE_GROUP, JOB_GROUP, [...], DedicatedScheduler
> : || nomad@flock03 ~ [78] ; condor_config_val DedicatedScheduler
> "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx"
> : || nomad@flock03 ~ [79] ; condor_config_val SUSPEND
> Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && (False)
> : || nomad@flock03 ~ [80] ; condor_config_val PREEMPT
> Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && ((
> ((Activity == "Suspended") && ((CurrentTime - EnteredCurrentActivity) >
> 10 * 60)) || (SUSPEND && (WANT_SUSPEND == False)) ))
> : || nomad@flock03 ~ [81] ; condor_config_val RANK_FACTOR
> 1000000
> : || nomad@flock03 ~ [82] ; condor_config_val RANK
> (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" * 1000000)
> + ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
> TARGET.USER_GROUP )
> : || nomad@flock03 ~ [83] ; condor_config_val START
> (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx") || (True
> && ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
> TARGET.USER_GROUP || MY.RESOURCE_GROUP == "ssli" ) && ( State !=
> "Claimed" || (CurrentTime - EnteredCurrentState) < 10 * 60 ) &&
> ((VirtualMachineID == 1) && ((2026 - Target.JobMaxMem -
> ifThenElse(isUndefined(slot2_JobMaxMem), 0, slot2_JobMaxMem)) > 0)) ||
> ((VirtualMachineID == 2) && ((2026 -
> ifThenElse(isUndefined(slot1_JobMaxMem), 0, slot1_JobMaxMem) -
> Target.JobMaxMem) > 0)))
>  -----
> 
> The submit job I'm using it:
> 
>  -----
> : || nomad@flock03 ~/condor [85] ; cat paralleltest
> #############################################
> ##   submit description file for a parallel program
> #############################################
> universe = parallel
> executable = /bin/sleep
> arguments = 30
> machine_count = 2
> log             = /homes/nomad/condor/log
> 
> queue
>  -----
> 
> The log file shows the job being submitted but nothing further.
> condor_q -better says there are 24 slots available to run this job:
> 
>  -----
> : || nomad@flock03 ~/condor [88] ; condor_q -better 7
> 
> 
> -- Submitter: flock03.ee.washington.edu : <128.208.232.223:33650> :
> flock03.ee.washington.edu
> ---
> 007.000:  Run analysis summary.  Of 30 machines,
>       4 are rejected by your job's requirements
>       2 reject your job because of their own requirements
>       0 match but are serving users with a better priority in the pool
>       0 match but reject the job for unknown reasons
>       0 match but will not currently preempt their existing job
>      24 are available to run your job
> 
> The Requirements expression for your job is:
> 
> ( ( ( MY.RESOURCE_GROUP is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
> undefined ) ) ) &&
> ( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
> ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
> ( TARGET.FileSystemDomain == MY.FileSystemDomain )
> 
>     Condition                         Machines Matched    Suggestion
>     ---------                         ----------------    ----------
> 1   ( target.Arch == "INTEL" )        26
> 2   ( ( ( "ssli" is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
> undefined ) ) )
>                                       30
> 3   ( target.OpSys == "LINUX" )       30
> 4   ( target.Disk >= 17 )             30
> 5   ( ( 1024 * target.Memory ) >= 17 )30
> 6   ( TARGET.FileSystemDomain == "ee.washington.edu" )
>                                       30
>  -----
> 
> And condor_status -submitter shows the jobs being inserted by the
> DedicatedScheduler:
> 
>  -----
> : || nomad@flock03 ~/condor [99] ; condor_status -submitter
> 
> Name                 Machine      Running IdleJobs HeldJobs
> 
> DedicatedScheduler@f flock03.ee         0        4        0
> asubram@xxxxxxxxxxxx flock03.ee         0        0        0
> nomad@xxxxxxxxxxxxxx flock03.ee         0        0        0
> 
>                            RunningJobs           IdleJobs           HeldJobs
> 
> DedicatedScheduler@f                 0                  4                  0
> asubram@xxxxxxxxxxxx                 0                  0                  0
> nomad@xxxxxxxxxxxxxx                 0                  0                  0
> 
>                Total                 0                  4                  0
>  -----
> 
> 
> 
> Any hints on where I should look to see why the job isn't running?
> 
> thanks,
> nomad
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/