[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Debugging a DedicatedScheduler?



On Thu, Jun 19, 2014 at 10:02:07AM -0500, Greg Thain wrote:
> On 06/19/2014 05:32 AM, Steffen Grunewald wrote:
> >
> >Apparently the DS isn't running - what am I missing, and how would
> >I find out more?
> >
> 
> Currently, condor_q -analyze doesn't know about the dedicated
> scheduler.  The first thing you want to do is make sure that the
> startd's idea of the schedd's name match the schedd's idea.  So, see
> which dedicated scheduler name the startds advertise they are
> willing to be managed by:
> 
> condor_status -af DedicatedScheduler

# condor_status -af DedicatedScheduler | uniq -c  
    ${node_count} undefined

> The output should be something like
> 
> DedicatedScheduler@my_schedd_name
> 
> Verify that the string after the (first) at sign matches
> 
> condor_status -schedd -af Name

returns the public FQDN(s) properly.

I suppose the "undefined" string is not what you'd expect, and I'd have
to ssh to one of the nodes to check why:

# condor_config_val -dump | grep STARTD
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
COLLECTOR_REPEAT_STARTD_ADS = 0
DAEMON_LIST = MASTER, STARTD
HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
MAX_STARTD_LOG = 10000000
NEGOTIATOR_INFORM_STARTD = true
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT = true
SCHEDD_USES_STARTD_FOR_LOCAL_UNIVERSE = True
SETTABLE_ATTRS_ADVERTISE_STARTD = 
STARTD = $(SBIN)/condor_startd
STARTD_AD_REEVAL_EXPR = 
STARTD_ADDRESS_FILE = $(RUN)/StartdAddress
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler
STARTD_CLAIM_ID_FILE = $(RUN)/StartdClaimId
STARTD_COMPUTE_AVAIL_STATS = false
STARTD_CONTACT_TIMEOUT = 45
STARTD_CRON_AUTOPUBLISH = If_Changed
STARTD_CRON_NAME = 
STARTD_DEBUG = D_COMMAND
STARTD_FACTORY_SCRIPT_AVAILABLE_PARTITIONS = 
STARTD_FACTORY_SCRIPT_BACK_PARTITION = 
STARTD_FACTORY_SCRIPT_BOOT_PARTITION = 
STARTD_FACTORY_SCRIPT_DESTROY_PARTITION = 
STARTD_FACTORY_SCRIPT_GENERATE_PARTITION = 
STARTD_FACTORY_SCRIPT_QUERY_WORK_LOADS = 
STARTD_FACTORY_SCRIPT_SHUTDOWN_PARTITION = 
STARTD_HAS_BAD_UTMP = 0
STARTD_HISTORY = $(LOG)/StartdHistory
STARTD_JOB_EXPRS = ImageSize, ExecutableSize, JobUniverse, NiceUser
STARTD_JOB_HOOK_KEYWORD = 
STARTD_LOG = $(LOG)/StartdLog
STARTD_MAX_AVAIL_PERIOD_SAMPLES = 100
STARTD_NAME = 
STARTD_NOCLAIM_SHUTDOWN = 0
STARTD_RESOURCE_PREFIX = 
STARTD_SENDS_ALIVES = peer
STARTD_SHOULD_WRITE_CLAIM_ID_FILE = true
STARTD_SLOT_ATTRS = State, Activity, EnteredCurrentActivity
STARTD_SLOT_EXPRS = 
STARTD_VM_ATTRS = 
STARTD_VM_EXPRS = 

# condor_config_val -dump | grep Scheduler
DedicatedScheduler = $(DEDICATED_SCHEDULER)
IsScheduler = (TARGET.JobUniverse == $(SCHEDULER_U))
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 10
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler

# condor_config_val -dump | grep Scheduler
DedicatedScheduler = $(DEDICATED_SCHEDULER)
IsScheduler = (TARGET.JobUniverse == $(SCHEDULER_U))
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 10
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler

# condor_config_val -dump | grep DEDICATED
DEDICATED_SCHEDULER = $(MASTER_MACHINE).(...)
...

and that one *is* properly defined as MASTER_MACHINE is set
(it's the one runing collector and negotiator)


I'm pretty sure I followed the config docs page by page, but I must've missed
something important along the way :(
What's worse: I can't see what and where :(:(

Thanks,
 Steffen


-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}