[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs of single group getting "lost" in negotiation-claim progress



Hi Max.  I presume from this command

 condor_q -global atlasprd ...

That the one of  users in question is atlasprd.  Those condor_q totals are showing  661+114+62+8 jobs running across all of those schedds.
That is consistent with the negotiator seeing 1400 cores claimed by jobs in the Atlas group.  The schedd shows jobs in "running" state
as soon as it has matches and is trying to start the jobs on those matches.  There shouldâ be something in the Sched log at that point, although possibly
 not at the default log level.   It might be more productive to track this from the execute side however.

Try running

   condor_status -claimed

And pick out one of the machines that has claimed slots for that user.  Then go to that machine and look at the StartLog and StarterLog.*   
You might also try  running condor_who on that machine.   I would do that first, to see if you catch any slots showing up as claimed by that user.

Best guess is that the Schedd is repeatedly trying and failing to start jobs on those machines, which would show up as activity mostly in the StartLog
and possibly the StarterLog.   If you have no messages in the ShadowLog on the schedd side, then logging on the execute side will be in the StartLog.  If the
process of starting a job gets further, then logging moves to the ShadowLog on the AP and StarterLog on the EP. 

In answer to your question, yes. it is possible for specific jobs to have resource requests that will match a partitionable slot, but not match the dynamic slot
which is created to satisfy the resource request.   This problem will show up in the StartLog.   In more recent versions of HTCondor the logging for this sort
of failure is better, but in the older versions the logging does exist, it's just not as directly helpful.

hope this helps.
-tj



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fischer, Max (SCC) <max.fischer@xxxxxxx>
Sent: Friday, February 2, 2024 4:52 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Jobs of single group getting "lost" in negotiation-claim progress
 
Hi all,

our cluster is in a very weird state that I have never seen before. No idea how to reproduce this but hoping anyone has ideas how to fix it.
Weâre still stuck on the HTCondor 9.X series due to user requirements.

Since about three days, we observe weird state for only a single group (ATLAS) out of roughly a dozen in total, which all work fine. What we observe is as follows:

1. The group is under its relative quota and the Negotiator is prioritising matching its jobs.
        - Jobs get matched to StartDs, as shown in the Negotiator log.
2. Once a job has been matched, it changes to `NumJobMatches = 0` in the queue *but does not start running*.
        - We see nothing in the Schedd, Shadow, Startd nor Starter logs for this job at this point.
        - The job is then stuck in this state, neither starting nor timing out a claim nor being re-matched.
        - The job also has an attribute `Matched = true` which isnât documented anywhere.
3. The Negotiator slowly reduces the count of 'Claimed Coresâ and âRequested Coresâ.
        - Consequently, it stops matching jobs of this group because it doesnât see them anymore.
        - We estimated that this matches the Negotiator plain ignoring jobs in this stuck state.
        - As an example, at the same time the Negotiator sees jobs equaling 1400 slots [0] but the Schedds see about 3000 jobs total [1] requesting even more slots

The points 2. and 3. are kinda problematic for us. ^^

What we especially donât get is that things are working perfectly fine for other groups. We have no special provisions (e.g. START, Requirements, GroupSortExpr, etc.) based on specific groups anywhere in the cluster.

Is there anything on *individual* jobs that could lead to such behaviour? Could there be some attributes that can interfere with jobs starting?

Cheers,
Max


[0] NegotiatorLog
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Group                Computed   Config    Quota      Use     Auto  Claimed Requestd SubmtersAllocatd
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Name                    quota    quota   static  surplus  Regroup    cores    cores in group   cores
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) ----------------------------------------------------------------------------------------------------
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) <none>               7.27596e-12        0        N        Y        Y        0        0       33       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Alice                   12323 0.237195        N        Y        Y    16312    17905        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Atlas                 15403.7 0.296493        N        Y        Y     1400     1400        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Auger                 167.185 0.003218        N        Y        Y      127      606        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Babar                 168.899 0.003251        N        Y        Y        0        0        0       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Belle                 2524.03 0.048583        N        Y        Y      593      592        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) CMS                   6893.96 0.132696        N        Y        Y    17688    30488        4       0

[1] # condor_q -global atlasprd -allusers -total
-- Schedd: htcondor-ce-4-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7b]:9618?... @ 02/02/24 11:34:26
Total for query: 1539 jobs; 0 completed, 0 removed, 878 idle, 661 running, 0 held, 0 suspended
Total for all users: 6221 jobs; 3 completed, 0 removed, 1525 idle, 4639 running, 54 held, 0 suspended

-- Schedd: htcondor-ce-3-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7d]:9618?... @ 02/02/24 11:34:26
Total for query: 1112 jobs; 0 completed, 0 removed, 998 idle, 114 running, 0 held, 0 suspended
Total for all users: 6057 jobs; 10 completed, 0 removed, 1639 idle, 4353 running, 55 held, 0 suspended

-- Schedd: pps-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:6c]:9618?... @ 02/02/24 11:34:26
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 33 jobs; 0 completed, 0 removed, 0 idle, 0 running, 33 held, 0 suspended

-- Schedd: htcondor-ce-1-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6a]:9618?... @ 02/02/24 11:34:26
Total for query: 214 jobs; 0 completed, 0 removed, 152 idle, 62 running, 0 held, 0 suspended
Total for all users: 5687 jobs; 4 completed, 0 removed, 750 idle, 4933 running, 0 held, 0 suspended

-- Schedd: pps-token-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:8e]:9618?... @ 02/02/24 11:34:26
Total for query: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended
Total for all users: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended

-- Schedd: htcondor-ce-2-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6c]:9618?... @ 02/02/24 11:34:26
Total for query: 43 jobs; 0 completed, 0 removed, 35 idle, 8 running, 0 held, 0 suspended
Total for all users: 5053 jobs; 12 completed, 0 removed, 546 idle, 4438 running, 57 held, 0 suspended

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/