[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CE jobs staying idle after unsuccessful match



Let me make sure I understand the setup correctly for these troubled jobs. They start out as Condor-C jobs (grid universe type âcondorâ) on the VOâs EL9/Condor23 submit machine (recently upgraded from EL7/Condor9). The jobs then arrive on your CE (EL7/Condor9) and become vanilla universe jobs to be matched with EL7/Condor9 Execution Points.

I donât see anything in the job ads that you sent that could cause issues with matchmaking or claiming.
Iâm mystified by these jobs only being matched once. If the schedd fails to claim a matched startd, it should try to re-match the affected job. How long are these jobs sitting idle in the queue after the successful match?

You say the VO has multiple submission nodes, with one upgraded to EL9/Condor23. Is there a difference in the job ads coming from Condor9 and Condor23 nodes?

 - Jaime

> On Feb 9, 2024, at 4:55âAM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
> 
> Hi all,
> 
> we have occassionaly jobs coming in via our EL7/9.0.15 CEs from one of our supported VOs. They recently switched one of their submission nodes to EL9/Condor23 and now sometimes jobs coming in from this submit node are staying idle in our Condor cluster.
> While so far only jobs from this submission node seem to be affected, only a subsection of all jobs from this submitter are problematic with other jobs starting without problems. So far we have not found an obvious difference between starting/running jobs and jobs seemingly stuck in idle, so that our first suspicion of an EL7/Condor9 vs. EL9/Condor23 issue did not hold.
> 
> Affected jobs seem to have all
>  'numjobmatches==1 && jobstatus==1'
> as ad states, i.e., all got matched once.
> 
> We increased our logging on the CE & Condor entry points and central managers to `ALL_DEBUG = D_FULLDEBUG` but so far obvious hints on why these jobs stay idle and are not re-matched are sparse.
> 
> On the active central manager, such jobs have a matching attempt logged like [1] where the target execution point's startd (dynamic slots) seem to just rejects the job. Afterwards, there seem to be no further matching attempts.
> In the rejecting worker's logs there are no hints of the affected cluster id, so I have no good idea why the worker did not accept the job (I am a bit hesitant to increase logging on all execution points).
> In principle, the jobs are matchable and better-analyze looks good [2] with our execution points nominally willing to run them.
> 
> Maybe someone has an idea, why these once matched & rejected jobs, i.e., numjobmatches==1, are not matched again?
> 
> Package versions are as of [3a,b] for the CE and the worker (not in sync due to reasons...)
> 
> Cheers,
>  Thomas
> 
> [1]
> NegotiatorLog:02/09/24 08:52:52     Request 19458678.00000: autocluster 723303 (request count 1 of 2)
> NegotiatorLog:02/09/24 08:52:52       Matched 19458678.0 group_ATLAS.atlasprd000@xxxxxxx <131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3> preempting none <131.169.161.162:9620?addrs=131.169.161.162-9620+[2001-638-700-10a0--1-1a2]-9620&alias=batch0558.desy.de&noUDP&sock=startd_3590_0516> slot1@xxxxxxxxxxxxxxxxx
> NegotiatorLog:02/09/24 08:52:52     Request 19458678.00000: autocluster 723303 (request count 2 of 2)
> NegotiatorLog:02/09/24 08:52:52       Rejected 19458678.0 group_ATLAS.atlasprd000@xxxxxxx <131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3>: no match found
> MatchLog:02/09/24 08:52:52       Matched 19458678.0 group_ATLAS.atlasprd000@xxxxxxx <131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3> preempting none <131.169.161.162:9620?addrs=131.169.161.162-9620+[2001-638-700-10a0--1-1a2]-9620&alias=batch0558.desy.de&noUDP&sock=startd_3590_0516> slot1@xxxxxxxxxxxxxxxxx
> MatchLog:02/09/24 08:52:52       Rejected 19458678.0 group_ATLAS.atlasprd000@xxxxxxx <131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3>: no match found
> 
> [2]
> -- Schedd: grid-htcondorce0.desy.de : <131.169.223.129:4792?...
> The Requirements expression for job 19458678.000 is
> 
>    NODE_IS_HEALTHY && ifThenElse(x509UserProxyVOName is "desy",TEST_RESOURCE == true,GRID_RESOURCE == true) && (OpSysAndVer == "CentOS7") &&
>    ifThenElse((x509UserProxyVOName isnt "desy") && (x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt "calice") &&
>      (x509UserProxyVOName isnt "belle"),(OLD_RESOURCE == false),(OLD_RESOURCE == false) || (OLD_RESOURCE == true)) && ifThenElse((x509UserProxyVOName isnt "desy") &&
>      (x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt "belle"),(BELLECALIBRATION_RESOURCE == false),(BELLECALIBRATION_RESOURCE is false) ||
>      (BELLECALIBRATION_RESOURCE is true))
> 
> Job 19458678.000 defines the following attributes:
> 
>    x509UserProxyVOName = "atlas"
> 
> The Requirements expression for job 19458678.000 reduces to these conditions:
> 
>         Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]        9634  NODE_IS_HEALTHY
> [1]        9634  ifThenElse(x509UserProxyVOName is "desy",TEST_RESOURCE == true,GRID_RESOURCE == true)
> [3]        9634  OpSysAndVer == "CentOS7"
> [5]        9634  ifThenElse((x509UserProxyVOName isnt "desy") && (x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt "calice") && (x509UserProxyVOName isnt "belle"),(OLD_RESOURCE == false),(OLD_RESOURCE == false) || (OLD_RESOURCE == true))
> [7]        9634  ifThenElse((x509UserProxyVOName isnt "desy") && (x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt "belle"),(BELLECALIBRATION_RESOURCE == false),(BELLECALIBRATION_RESOURCE is false) || (BELLECALIBRATION_RESOURCE is true))
> 
> 
> 19458678.000:  Job has been matched.
> 
> Last successful match: Fri Feb  9 08:52:52 2024
> 
> 
> 19458678.000:  Run analysis summary ignoring user priority.  Of 359 machines,
>      0 are rejected by your job's requirements
>     17 reject your job because of their own requirements
>      0 match and are already running your jobs
>      0 match but are serving other users
>    342 are able to run your job
> 
> 
> [3.a - CE Entry Point]
> condor-9.0.15-1.el7.x86_64
> condor-boinc-7.16.16-1.el7.x86_64
> condor-classads-9.0.15-1.el7.x86_64
> condor-externals-9.0.15-1.el7.x86_64
> condor-procd-9.0.15-1.el7.x86_64
> htcondor-ce-5.1.5-1.el7.noarch
> htcondor-ce-apel-5.1.5-1.el7.noarch
> htcondor-ce-bdii-5.1.3-1.el7.noarch
> htcondor-ce-client-5.1.5-1.el7.noarch
> htcondor-ce-condor-5.1.5-1.el7.noarch
> htcondor-ce-view-5.1.5-1.el7.noarch
> python2-condor-9.0.15-1.el7.x86_64
> python3-condor-9.0.15-1.el7.x86_64
> 
> [3.b - Execution Point]
> condor-9.0.8-1.el7.x86_64
> condor-boinc-7.16.16-1.el7.x86_64
> condor-classads-9.0.8-1.el7.x86_64
> condor-externals-9.0.8-1.el7.x86_64
> condor-procd-9.0.8-1.el7.x86_64
> htcondor-ce-client-5.1.3-1.el7.noarch
> python2-condor-9.0.8-1.el7.x86_64
> python3-condor-9.0.8-1.el7.x86_64<ce-6599465.0_lrms-19458678.0.ce><ce-6599465.0_lrms-19458678.0.lrms>_______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/