[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs of single group getting "lost" in negotiation-claim progress



Hi all,

our cluster is in a very weird state that I have never seen before. No idea how to reproduce this but hoping anyone has ideas how to fix it.
Weâre still stuck on the HTCondor 9.X series due to user requirements.

Since about three days, we observe weird state for only a single group (ATLAS) out of roughly a dozen in total, which all work fine. What we observe is as follows:

1. The group is under its relative quota and the Negotiator is prioritising matching its jobs.
	- Jobs get matched to StartDs, as shown in the Negotiator log.
2. Once a job has been matched, it changes to `NumJobMatches = 0` in the queue *but does not start running*.
	- We see nothing in the Schedd, Shadow, Startd nor Starter logs for this job at this point.
	- The job is then stuck in this state, neither starting nor timing out a claim nor being re-matched.
	- The job also has an attribute `Matched = true` which isnât documented anywhere.
3. The Negotiator slowly reduces the count of 'Claimed Coresâ and âRequested Coresâ.
	- Consequently, it stops matching jobs of this group because it doesnât see them anymore.
	- We estimated that this matches the Negotiator plain ignoring jobs in this stuck state.
	- As an example, at the same time the Negotiator sees jobs equaling 1400 slots [0] but the Schedds see about 3000 jobs total [1] requesting even more slots

The points 2. and 3. are kinda problematic for us. ^^

What we especially donât get is that things are working perfectly fine for other groups. We have no special provisions (e.g. START, Requirements, GroupSortExpr, etc.) based on specific groups anywhere in the cluster.

Is there anything on *individual* jobs that could lead to such behaviour? Could there be some attributes that can interfere with jobs starting?

Cheers,
Max


[0] NegotiatorLog
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Group                Computed   Config    Quota      Use     Auto  Claimed Requestd SubmtersAllocatd
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Name                    quota    quota   static  surplus  Regroup    cores    cores in group   cores
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) ----------------------------------------------------------------------------------------------------
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) <none>               7.27596e-12        0        N        Y        Y        0        0       33       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Alice                   12323 0.237195        N        Y        Y    16312    17905        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Atlas                 15403.7 0.296493        N        Y        Y     1400     1400        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Auger                 167.185 0.003218        N        Y        Y      127      606        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Babar                 168.899 0.003251        N        Y        Y        0        0        0       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) Belle                 2524.03 0.048583        N        Y        Y      593      592        4       0
02/02/24 11:20:37 (pid:69874) (D_ALWAYS) CMS                   6893.96 0.132696        N        Y        Y    17688    30488        4       0

[1] # condor_q -global atlasprd -allusers -total
-- Schedd: htcondor-ce-4-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7b]:9618?... @ 02/02/24 11:34:26
Total for query: 1539 jobs; 0 completed, 0 removed, 878 idle, 661 running, 0 held, 0 suspended
Total for all users: 6221 jobs; 3 completed, 0 removed, 1525 idle, 4639 running, 54 held, 0 suspended

-- Schedd: htcondor-ce-3-kit.gridka.de : <[2a00:139c:3:2e5:0:61:6a:7d]:9618?... @ 02/02/24 11:34:26
Total for query: 1112 jobs; 0 completed, 0 removed, 998 idle, 114 running, 0 held, 0 suspended
Total for all users: 6057 jobs; 10 completed, 0 removed, 1639 idle, 4353 running, 55 held, 0 suspended

-- Schedd: pps-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:6c]:9618?... @ 02/02/24 11:34:26
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 33 jobs; 0 completed, 0 removed, 0 idle, 0 running, 33 held, 0 suspended

-- Schedd: htcondor-ce-1-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6a]:9618?... @ 02/02/24 11:34:26
Total for query: 214 jobs; 0 completed, 0 removed, 152 idle, 62 running, 0 held, 0 suspended
Total for all users: 5687 jobs; 4 completed, 0 removed, 750 idle, 4933 running, 0 held, 0 suspended

-- Schedd: pps-token-htcondor-ce.gridka.de : <[2a00:139c:3:2e5:0:61:d2:8e]:9618?... @ 02/02/24 11:34:26
Total for query: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended
Total for all users: 12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended

-- Schedd: htcondor-ce-2-kit.gridka.de : <[2a00:139c:3:2e5:0:61:1:6c]:9618?... @ 02/02/24 11:34:26
Total for query: 43 jobs; 0 completed, 0 removed, 35 idle, 8 running, 0 held, 0 suspended
Total for all users: 5053 jobs; 12 completed, 0 removed, 546 idle, 4438 running, 57 held, 0 suspended