[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CondorCE with Condor HA setup broke



Hi all,

we moved today our Condor LRMS to HA and I stumbled over a problem, that
the CondorCEs had problems with the two heads. Interestingly, I had not
run into the issue on my test cluster as trace jobs to the test CEs
reached their LRMS Condor.
Also on the production cluster setup I had not noticed the issue at
first as trace jobs to the production CondorCEs went through to Condor
and started to run - however, real user jobs failed to get passed
through [1]

I pinned for the moment the CEs' LRMS condor configs to a non-HA single
CONDOR_HOST, which works with the CondorCE config [2,3].

But I am looking now for the proper setup to attach the CondorCEs to the
HA-aware schedulers ð - and why the trace jobs went through while real
jobs failed? Since the trace jobs should also have gone throught the CE
to reach the cluster, or?

Cheers,
  Thomas


[1] SchedLog @ grid-htcondorce1.desy.de
12/13/21 16:21:07 Can't find address for startd grid-htcondorce1.desy.de
12/13/21 16:21:07 Can't find address for negotiator
12/13/21 16:21:07 Failed to send RESCHEDULE to unknown daemon:
12/13/21 16:21:07 Job 977401.0 released from hold: Data files spooled


[2] CE sched conf
JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool
JOB_ROUTER_SCHEDD2_NAME=$(FULL_HOSTNAME)
JOB_ROUTER_SCHEDD2_POOL=condor01.desy.de:9618

[3]
# CENTRAL_MANAGER1 = condor01.desy.de
# CENTRAL_MANAGER2 = grid-htc-master02.desy.de
#CONDOR_HOST = condor01.desy.de,grid-htc-master02.desy.de
CONDOR_HOST = condor01.desy.de