[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] AutoClusterAttrs - How does Condor decide what to use?

Hi All

Twice in the past 6 months we've had an issue that ultimately tracked back
to be due to the AutoClusterAttrs job classad on a particular submit machine.
Or to be more correct, to a particular user's jobs on that submit node.

The original symptoms were 100% CPU use on the central manager, causing all
users jobs from all submit nodes to struggle to get resources allocated to them.
It appeared to be the condor_negotiator that was the culprit. It was spending
all it's time (5,10, even 15 minutes) for each negotiation cycle. Further investigation
eventually traced it back to jobs from one user on one submit node. The negotiator
logs showed that it was negotiating every single job (for many thousands) from that
user/submit node. This then led to checking out the AutoClusterAttrs classad.

Most other jobs from other users and submit nodes had something like:

AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,SHORTJOB,User,Di

However the problem jobs had:

AutoClusterAttrs = "JobUniverse,GlobalJobId,LastCheckpointPlatform,NumCkpts,SHORTJOB,User,Di

i.e. it included GlobalJobId which is unique for each submitted job, hence the issue.

So, a couple of things:

1. How does condor decide what attributes to use for AutoClusterAttrs?

2. We worked around this by adding AutoClusterAttrs to the .local config file
on the submit node. It has now occurred again with a different user/submit node
so are looking at making global config changes for all submit machines.

3. If anyone else has this issue, or wants to check out their systems then
the following will print out a count of each AutoClusterID number.

condor_q -name linear-yf.nexus.csiro.au -pool condor-act.csiro.au -f "%s" AutoClusterID -f " %s" ClusterID -f ".%s\n" ProcID | sort | awk ' BEGIN { print "AutoClusterID   Num of jobs"} {FS=" "}  { cats[$1] = cats[$1] + 1} END { for(c in cats) { print c, "=", cats[c]} } '

For linux only, my head hurts thinking about how to do this in DOS! :)
Fortunately, although our fleet is windows, our CMs are all linux.
If you see all AutoClusterID's with a count of 1 then that's the problem.
You will normally see something like:

AutoClusterID   Num of jobs
4 = 72
5 = 32
6 = 3
7 = 2
8 = 1
9 = 1
10 = 2
1 = 2344
2 = 65
3 = 69

i.e. the negotiator has only 10 distinct clusters to negotiate for rather
than 2591 distinct clusters if each job has a unique ClusterID.