[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Debugging config setup for Parallel Universe machine pool



Hi David -

Re the below, considering you are getting the error
"Job requested parallel scheduling groups, but no groups found"
I have to wonder about of the ParallelSchedulingGroup is indeed showing up in your machine (startd) classads. I see you confirmed it is in the config file via condor_config_val, but I'd suggest confirming it is truly in the machine classad by doing
  condor_status -l | grep -i ParallelSchedulingGroup
or some such. If it is not there, perhaps it could be something as simple as not doing something like "condor_reconfig -all" from your central manager after editing your config files to add in the ParallelSchedulingGroup settings?

regards,
Todd



On 3/22/2013 10:00 AM, David Hentchel wrote:
I am a new Condor user and am stalled attempting to execute a Parallel
Universe job on a setup with 2 host machines.

Our software is designed to scale to thousands of host machines, with
the aggregate behaving as if was a single logical database running on a
huge central machine. As such, HTCondor seems like the ideal tool to
manage our multi-host testing requirements.

Condor installation and initial setup was easy. Specifics:
- version 7.6.10
- non-root install
- Manager, Scheduler and 1 Execute node on host p1
- Pooled execute node on host p2
- after initial install a simple Vanilla Universe job ran fine
- have a shared main condor_config, plus a localized condor_config.local
for each host
I made the following config file changes (note, in all these samples I'm
replacing IP addresses and full hostnames with short hostname):
============  Main condor.config  ================
##  central pool manager
COLLECTOR_NAME          = NuoDB-DHentchel-p1
## Map Scheduler to Parallel Universe group; this defines a pool for
concurrent, parallel runs
SCHEDD_NAME     = DedicatedScheduler
DedicatedScheduler      = "DedicatedScheduler@p1"
ParallelSchedulingGroup = "P5"
## Security settings:
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
============  Each condor_config.local  ================
## Bind to parent scheduler group, to enable parallel universe dispatch
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
STARTD_ATTRS = $(STARTD_ATTRS), ParallelSchedulingGroup
## Tune STARTD for dedicated, parallel scheduling
START     = True
RANK      = Scheduler =?= $(DedicatedScheduler)
LOCAL_DIR = /var/local/condor/$(HOSTNAME)
(For p1 only):
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), p2
(for p2 only):
DAEMON_LIST = MASTER, STARTD

I validated that the critical config variables are set on both machines:
condor_config_val COLLECTOR_NAME ==> NuoDB-DHentchel-p1
condor_config_val SCHEDD_NAME ==> DedicatedScheduler
condor_config_val DedicatedScheduler ==> "DedicatedScheduler@p1"
condor_config_val ParallelSchedulingGroup ==> "P1"
condor_config_val STARTD_ATTRS ==> COLLECTOR_HOST_STRING,
DedicatedScheduler, ParallelSchedulingGroup
condor_config_val RANK ==> Scheduler =?= "DedicatedScheduler@p1"

Now I submit the following job:
universe     = parallel
Scheduler = "DedicatedScheduler@p1"
executable   = /bin/sleep
arguments    = 30
machine_count = 2
+WantParallelSchedulingGroups = True
error   = log/simple-sleep.$(PID).err
output  = log/simple-sleep.$(PID).out
log     = log/simple-sleep.$(PID).log
queue

6.  This sits on the queue and is never dispatched.
-- Submitter: DedicatedScheduler@p1: <p1:51203> : p1
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   1.0   dhentchel       3/21 16:12   0+00:00:00 I  0   0.0  sleep 30
   2.0   dhentchel       3/22 09:57   0+00:00:00 I  0   0.0  sleep 30

 From the log files, I see some interesting messages.
CollectorLog:
03/21/13 16:12:34 SubmittorAd  : Inserting ** "< dhentchel@xxxxxxxxx
<mailto:dhentchel@xxxxxxxxx> DedicatedScheduler@p1 , p1 >"
03/21/13 16:12:34 stats: Inserting new hashent for
'Submittor':'dhentchel@xxxxxxxxx
<mailto:Submittor%27%3A%27dhentchel@xxxxxxxxx>':'p1'
MatchLog:
03/22/13 09:57:46       Matched 2.0 dhentchel@xxxxxxxxx
<mailto:dhentchel@xxxxxxxxx> <p1:51203> preempting none <p2:33543> slot1@p2
03/22/13 09:58:46       Matched 2.0 dhentchel@xxxxxxxxx
<mailto:dhentchel@xxxxxxxxx> <p1:51203> preempting none <p2:33543> slot2@p2
etc, etc. all expected slots on both hosts show up
SchedLog
03/21/13 16:09:08 (pid:10111) ** SubsystemInfo: name=SCHEDD
type=SCHEDD(5) class=DAEMON(1)
03/21/13 16:12:34 (pid:10111) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
03/21/13 16:12:34 (pid:10111) Sent ad to central manager for
dhentchel@xxxxxxxxx <mailto:dhentchel@xxxxxxxxx>
03/21/13 16:12:34 (pid:10111) Sent ad to 1 collectors for
dhentchel@xxxxxxxxx <mailto:dhentchel@xxxxxxxxx>
03/21/13 16:12:34 (pid:10111) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
03/21/13 16:12:34 (pid:10111) Trying to satisfy job with group scheduling
03/21/13 16:12:34 (pid:10111) Job requested parallel scheduling groups,
but no groups found

So all condor processes seem to be communicating correctly, but the
submit request with "universe = parallel"
is failing with message "Job requested parallel scheduling groups, but
no groups found"

I'm hoping that someone can identify what is missing in the
configuration, or perhaps give me advice on how to dig deeper to find
where the machine pool setup is going astray.





--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685