[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] trying to get parallel-universe jobs working



One of the users here has decided he wants to run MPI jobs.  In trying
to set up the parallel universe I can't even get the simple "sleep 30"
job from
<http://www.cs.wisc.edu/condor/manual/v7.2/2_9Parallel_Applications.html>
to launch, it just sits idle.

I've followed the instructions in
<http://www.cs.wisc.edu/condor/manual/v7.2/3_13Setting_Up.html#sec:Config-Dedicated-Jobs>
and have the following configuration values set on the 8 test hosts.
Vanilla universe jobs submitted on these hosts run just fine.  Parallel
universe jobs just sit idle.

 -----
: || nomad@flock03 ~ [77] ; condor_config_val STARTD_ATTRS
RESOURCE_GROUP, JOB_GROUP, [...], DedicatedScheduler
: || nomad@flock03 ~ [78] ; condor_config_val DedicatedScheduler
"DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx"
: || nomad@flock03 ~ [79] ; condor_config_val SUSPEND
Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && (False)
: || nomad@flock03 ~ [80] ; condor_config_val PREEMPT
Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && ((
((Activity == "Suspended") && ((CurrentTime - EnteredCurrentActivity) >
10 * 60)) || (SUSPEND && (WANT_SUSPEND == False)) ))
: || nomad@flock03 ~ [81] ; condor_config_val RANK_FACTOR
1000000
: || nomad@flock03 ~ [82] ; condor_config_val RANK
(Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" * 1000000)
+ ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
TARGET.USER_GROUP )
: || nomad@flock03 ~ [83] ; condor_config_val START
(Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx") || (True
&& ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
TARGET.USER_GROUP || MY.RESOURCE_GROUP == "ssli" ) && ( State !=
"Claimed" || (CurrentTime - EnteredCurrentState) < 10 * 60 ) &&
((VirtualMachineID == 1) && ((2026 - Target.JobMaxMem -
ifThenElse(isUndefined(slot2_JobMaxMem), 0, slot2_JobMaxMem)) > 0)) ||
((VirtualMachineID == 2) && ((2026 -
ifThenElse(isUndefined(slot1_JobMaxMem), 0, slot1_JobMaxMem) -
Target.JobMaxMem) > 0)))
 -----

The submit job I'm using it:

 -----
: || nomad@flock03 ~/condor [85] ; cat paralleltest
#############################################
##   submit description file for a parallel program
#############################################
universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 2
log             = /homes/nomad/condor/log

queue
 -----

The log file shows the job being submitted but nothing further.
condor_q -better says there are 24 slots available to run this job:

 -----
: || nomad@flock03 ~/condor [88] ; condor_q -better 7


-- Submitter: flock03.ee.washington.edu : <128.208.232.223:33650> :
flock03.ee.washington.edu
---
007.000:  Run analysis summary.  Of 30 machines,
      4 are rejected by your job's requirements
      2 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
     24 are available to run your job

The Requirements expression for your job is:

( ( ( MY.RESOURCE_GROUP is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
undefined ) ) ) &&
( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( target.Arch == "INTEL" )        26
2   ( ( ( "ssli" is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
undefined ) ) )
                                      30
3   ( target.OpSys == "LINUX" )       30
4   ( target.Disk >= 17 )             30
5   ( ( 1024 * target.Memory ) >= 17 )30
6   ( TARGET.FileSystemDomain == "ee.washington.edu" )
                                      30
 -----

And condor_status -submitter shows the jobs being inserted by the
DedicatedScheduler:

 -----
: || nomad@flock03 ~/condor [99] ; condor_status -submitter

Name                 Machine      Running IdleJobs HeldJobs

DedicatedScheduler@f flock03.ee         0        4        0
asubram@xxxxxxxxxxxx flock03.ee         0        0        0
nomad@xxxxxxxxxxxxxx flock03.ee         0        0        0

                           RunningJobs           IdleJobs           HeldJobs

DedicatedScheduler@f                 0                  4                  0
asubram@xxxxxxxxxxxx                 0                  0                  0
nomad@xxxxxxxxxxxxxx                 0                  0                  0

               Total                 0                  4                  0
 -----



Any hints on where I should look to see why the job isn't running?

thanks,
nomad