[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Debugging config setup for Parallel Universe machine pool
- Date: Fri, 22 Mar 2013 12:16:29 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Debugging config setup for Parallel Universe machine pool
Hi David -
Re the below, considering you are getting the error
"Job requested parallel scheduling groups, but no groups found"
I have to wonder about of the ParallelSchedulingGroup is indeed showing
up in your machine (startd) classads. I see you confirmed it is in the
config file via condor_config_val, but I'd suggest confirming it is
truly in the machine classad by doing
condor_status -l | grep -i ParallelSchedulingGroup
or some such. If it is not there, perhaps it could be something as
simple as not doing something like "condor_reconfig -all" from your
central manager after editing your config files to add in the
On 3/22/2013 10:00 AM, David Hentchel wrote:
I am a new Condor user and am stalled attempting to execute a Parallel
Universe job on a setup with 2 host machines.
Our software is designed to scale to thousands of host machines, with
the aggregate behaving as if was a single logical database running on a
huge central machine. As such, HTCondor seems like the ideal tool to
manage our multi-host testing requirements.
Condor installation and initial setup was easy. Specifics:
- version 7.6.10
- non-root install
- Manager, Scheduler and 1 Execute node on host p1
- Pooled execute node on host p2
- after initial install a simple Vanilla Universe job ran fine
- have a shared main condor_config, plus a localized condor_config.local
for each host
I made the following config file changes (note, in all these samples I'm
replacing IP addresses and full hostnames with short hostname):
============ Main condor.config ================
## central pool manager
COLLECTOR_NAME = NuoDB-DHentchel-p1
## Map Scheduler to Parallel Universe group; this defines a pool for
concurrent, parallel runs
SCHEDD_NAME = DedicatedScheduler
DedicatedScheduler = "DedicatedScheduler@p1"
ParallelSchedulingGroup = "P5"
## Security settings:
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
============ Each condor_config.local ================
## Bind to parent scheduler group, to enable parallel universe dispatch
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
STARTD_ATTRS = $(STARTD_ATTRS), ParallelSchedulingGroup
## Tune STARTD for dedicated, parallel scheduling
START = True
RANK = Scheduler =?= $(DedicatedScheduler)
LOCAL_DIR = /var/local/condor/$(HOSTNAME)
(For p1 only):
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), p2
(for p2 only):
DAEMON_LIST = MASTER, STARTD
I validated that the critical config variables are set on both machines:
condor_config_val COLLECTOR_NAME ==> NuoDB-DHentchel-p1
condor_config_val SCHEDD_NAME ==> DedicatedScheduler
condor_config_val DedicatedScheduler ==> "DedicatedScheduler@p1"
condor_config_val ParallelSchedulingGroup ==> "P1"
condor_config_val STARTD_ATTRS ==> COLLECTOR_HOST_STRING,
condor_config_val RANK ==> Scheduler =?= "DedicatedScheduler@p1"
Now I submit the following job:
universe = parallel
Scheduler = "DedicatedScheduler@p1"
executable = /bin/sleep
arguments = 30
machine_count = 2
+WantParallelSchedulingGroups = True
error = log/simple-sleep.$(PID).err
output = log/simple-sleep.$(PID).out
log = log/simple-sleep.$(PID).log
6. This sits on the queue and is never dispatched.
-- Submitter: DedicatedScheduler@p1: <p1:51203> : p1
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 dhentchel 3/21 16:12 0+00:00:00 I 0 0.0 sleep 30
2.0 dhentchel 3/22 09:57 0+00:00:00 I 0 0.0 sleep 30
From the log files, I see some interesting messages.
03/21/13 16:12:34 SubmittorAd : Inserting ** "< dhentchel@xxxxxxxxx
<mailto:dhentchel@xxxxxxxxx> DedicatedScheduler@p1 , p1 >"
03/21/13 16:12:34 stats: Inserting new hashent for
03/22/13 09:57:46 Matched 2.0 dhentchel@xxxxxxxxx
<mailto:dhentchel@xxxxxxxxx> <p1:51203> preempting none <p2:33543> slot1@p2
03/22/13 09:58:46 Matched 2.0 dhentchel@xxxxxxxxx
<mailto:dhentchel@xxxxxxxxx> <p1:51203> preempting none <p2:33543> slot2@p2
etc, etc. all expected slots on both hosts show up
03/21/13 16:09:08 (pid:10111) ** SubsystemInfo: name=SCHEDD
03/21/13 16:12:34 (pid:10111) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
03/21/13 16:12:34 (pid:10111) Sent ad to central manager for
03/21/13 16:12:34 (pid:10111) Sent ad to 1 collectors for
03/21/13 16:12:34 (pid:10111) Inserting new attribute Scheduler into
non-active cluster cid=1 acid=-1
03/21/13 16:12:34 (pid:10111) Trying to satisfy job with group scheduling
03/21/13 16:12:34 (pid:10111) Job requested parallel scheduling groups,
but no groups found
So all condor processes seem to be communicating correctly, but the
submit request with "universe = parallel"
is failing with message "Job requested parallel scheduling groups, but
no groups found"
I'm hoping that someone can identify what is missing in the
configuration, or perhaps give me advice on how to dig deeper to find
where the machine pool setup is going astray.
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685