[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CE idle job pile up



Hi all,

on our two CondorCEs we observed pile-ups of idle jobs - yesterday on
the first one and today on the other one.
For a specific VO (here ATLAS) hardly any jobs went from idle on the CE
to running on the batch Condor [1]. `-better-analyze`s were inconclusive
and the userprios were in favour of the group. Judging from the CE logs,
it seems that the CE did not not have tried to submit its jobs at all -
at least the job ids did not appear in the SchedLogs and routes.
However, all daemons waere responsive and CLI commands all responded in
time.

Restarting the condor-ce & condor units 'fixed' the issues in both cases
- as shortly after each restart, most of the previous idle jobs got
already routed and started to run [2].

Unfortunately, I have not found a clue in the logs, that could be used
as trigger for an alarm (to notice the issue a bit earlier and force a
restart).

Maybe somebody has noticed something similar and has a suggestion, what
might be a good trigger to watch for in such a case?

Cheers,
  Thomas

[1.before services restarts]
Total for query: 1747 jobs; 1 completed, 0 removed, 1577 idle, 169
running, 0 held, 0 suspended

[2.after services restarts]
Total for query: 1730 jobs; 25 completed, 0 removed, 167 idle, 1538
running, 0 held, 0 suspended


[3]
htcondor-ce-view-5.1.3-1.el7.noarch
python2-condor-9.0.11-1.el7.x86_64
htcondor-ce-5.1.3-1.el7.noarch
htcondor-ce-bdii-5.1.3-1.el7.noarch
condor-classads-9.0.11-1.el7.x86_64
condor-9.0.11-1.el7.x86_64
htcondor-ce-condor-5.1.3-1.el7.noarch
condor-procd-9.0.11-1.el7.x86_64
condor-externals-9.0.11-1.el7.x86_64
htcondor-ce-client-5.1.3-1.el7.noarch
htcondor-ce-apel-5.1.3-1.el7.noarch
python3-condor-9.0.11-1.el7.x86_64

on CentOS Linux release 7.9.2009 3.10.0-1160.62.1.el7.x86_64

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature