[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE with Condor HA setup broke



An HA configuration for the Condor LRMS should not be an issue for the CE. I also wouldnât expect real jobs to fail when trace jobs succeeded. I assume excerpt [1] is from the CE SchedLog and [3] from the LRMS Condor configuration?

The messages in [1] donât look like a problem. I expect to see them, since the CE doesnât have a startd or negotiator. Do you see anything else in the logs thatâs indicative of a problem? Is the Job Router failing to contact the LRMS schedd?

 - Jaime

> On Dec 13, 2021, at 10:04 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
> 
> Hi all,
> 
> we moved today our Condor LRMS to HA and I stumbled over a problem, that
> the CondorCEs had problems with the two heads. Interestingly, I had not
> run into the issue on my test cluster as trace jobs to the test CEs
> reached their LRMS Condor.
> Also on the production cluster setup I had not noticed the issue at
> first as trace jobs to the production CondorCEs went through to Condor
> and started to run - however, real user jobs failed to get passed
> through [1]
> 
> I pinned for the moment the CEs' LRMS condor configs to a non-HA single
> CONDOR_HOST, which works with the CondorCE config [2,3].
> 
> But I am looking now for the proper setup to attach the CondorCEs to the
> HA-aware schedulers ð - and why the trace jobs went through while real
> jobs failed? Since the trace jobs should also have gone throught the CE
> to reach the cluster, or?
> 
> Cheers,
>  Thomas
> 
> 
> [1] SchedLog @ grid-htcondorce1.desy.de
> 12/13/21 16:21:07 Can't find address for startd grid-htcondorce1.desy.de
> 12/13/21 16:21:07 Can't find address for negotiator
> 12/13/21 16:21:07 Failed to send RESCHEDULE to unknown daemon:
> 12/13/21 16:21:07 Job 977401.0 released from hold: Data files spooled
> 
> 
> [2] CE sched conf
> JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool
> JOB_ROUTER_SCHEDD2_NAME=$(FULL_HOSTNAME)
> JOB_ROUTER_SCHEDD2_POOL=condor01.desy.de:9618
> 
> [3]
> # CENTRAL_MANAGER1 = condor01.desy.de
> # CENTRAL_MANAGER2 = grid-htc-master02.desy.de
> #CONDOR_HOST = condor01.desy.de,grid-htc-master02.desy.de
> CONDOR_HOST = condor01.desy.de
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/