[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE with Condor HA setup broke



Hi Jaime,

I digged a bit into the issue and the message about a failed daemon
RESCHEDULE command really seems to be a red herring - at least it
appears in fulldebug in all cases.

---

The issue is somewhat difficult to reproduce as trace jobs are
apparently not affected at all - and debugging on the production
machines is somewhat awkward.


On one of our production CEs, I have now added comma-separated both LRMS
Condor heads to the CE's JOB_ROUTER_SCHEDD2_POOL ad, e.g.,
  JOB_ROUTER_SCHEDD2_POOL=mainhead.fqdn.fo:9618,fallbackhead.fqdn.fo:9618

So far it seems to work and production job are propagating to the LRMS
Condor. And the CE schedd sends updates to updates both manager nodes [2]

But I am not sure, if this is the proper way to attach the CE to the
pool's manager(s)? E.g., I see massages about failing job removals(?) in
the router log [3] - which seems not to be healthy, or? (maybe the
manager list is parsed taken as a single name here?)

Cheers,
  Thomas

(the active negotiator is running on our condor01 node and collectors
are running on both, condor01 & grid-htc-master02)

[1]
12/15/21 14:58:08 Sending RESCHEDULE command to negotiator(s)
12/15/21 14:58:08 Will use TCP to update collector
grid-htcondorce-dev.desy.de
<131.169.223.131:9619?alias=grid-htcondorce-dev.desy.de>
12/15/21 14:58:08 Trying to query collector
<131.169.223.131:9619?alias=grid-htcondorce-dev.desy.de>
12/15/21 14:58:08 Can't find address for negotiator
12/15/21 14:58:08 Failed to send RESCHEDULE to unknown daemon:
12/15/21 14:58:08 ForkWorker::Fork: New child of 3252024 = 3252206
12/15/21 14:58:08 Number of Active Workers 0


[2] JobRouterLog
12/15/21 15:11:23 HOOK_JOB_FINALIZE not configured.
12/15/21 15:11:23 Will use TCP to update collector condor01.desy.de
<131.169.56.33:9618?alias=condor01.desy.de>
12/15/21 15:11:23 Will use TCP to update collector
grid-htc-master02.desy.de
<131.169.223.100:9618?alias=grid-htc-master02.desy.de>
12/15/21 15:11:23 Trying to query collector
<131.169.223.100:9618?alias=grid-htc-master02.desy.de>
12/15/21 15:11:23 SharedPortClient: sent connection request to schedd at
<131.169.223.131:9620> for shared port id schedd_3253687_3afb
12/15/21 15:11:23 (6.0) Writing terminate record to user logfile
...

[3] JobRouterLog
12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=996335.5,dest=1739316.0,route=Local_Condor): failed to remove dest
job: Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter failure
(src=993489.0,dest=1737377.0,route=DESYGRID): giving up, because
submitted job is still not in job queue mirror (submitted 614 seconds
ago).  Perhaps it has been removed?
12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de
12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=993489.0,dest=1737377.0,route=DESYGRID): failed to remove dest job:
Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=992300.0,dest=1739178.0,route=Local_Condor): dest job was removed!
12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de
12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=992300.0,dest=1739178.0,route=Local_Condor): failed to remove dest
job: Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=991185.0,dest=1739240.0,route=Local_Condor): dest job was removed!
12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de
12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=991185.0,dest=1739240.0,route=Local_Condor): failed to remove dest
job: Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=997105.0,dest=1739180.0,route=Local_Condor): dest job was removed!
12/15/21 15:42:26 Can't find address for schedd grid-htcondorce1.desy.de
12/15/21 15:42:26 Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=997105.0,dest=1739180.0,route=Local_Condor): failed to remove dest
job: Unable to find address of grid-htcondorce1.desy.de at
condor01.desy.de:9618,grid-htc-master02.desy.de:9618
12/15/21 15:42:26 JobRouter
(src=993563.0,dest=1739181.0,route=Local_Condor): dest job was removed!
12/15/21 15:42:26 DCSchedd:actOnJobs: Action failed
...
12/15/21 15:42:41 Routing jobs to schedd grid-htcondorce1.desy.de in
pool condor01.desy.de:9618,grid-htc-master02.desy.de:9618
...

12/15/21 15:53:35 JobRouter
(src=995193.1,dest=1739584.0,route=DESYGRID): failed to remove dest job:
Job 1739584.0 not found
12/15/21 15:53:35 JobRouter failure
(src=995193.4,dest=1739585.0,route=DESYGRID): giving up, because
submitted job is still not in job queue mirror (submitted 606 seconds
ago).  Perhaps it has been removed?
12/15/21 15:53:35 DCSchedd:actOnJobs: Action failed
12/15/21 15:53:35 JobRouter
(src=995193.4,dest=1739585.0,route=DESYGRID): failed to remove dest job:
Job 1739585.0 not found
12/15/21 15:53:35 JobRouter
(src=993667.0,dest=1741438.0,route=Local_Condor): dest job was removed!
12/15/21 15:53:35 DCSchedd:actOnJobs: Action failed




On 14/12/2021 04.12, Jaime Frey wrote:
> An HA configuration for the Condor LRMS should not be an issue for the CE. I also wouldnât expect real jobs to fail when trace jobs succeeded. I assume excerpt [1] is from the CE SchedLog and [3] from the LRMS Condor configuration?
> 
> The messages in [1] donât look like a problem. I expect to see them, since the CE doesnât have a startd or negotiator. Do you see anything else in the logs thatâs indicative of a problem? Is the Job Router failing to contact the LRMS schedd?
> 
>  - Jaime

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature