[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]



Hi all,

we probably found the cause and fixed it (fingers crossed)

post mortem ~>
https://confluence.desy.de/pages/viewpage.action?pageId=47425023

Presumably during a 'transparent' maintenance on the ARC's underlying
supervisor, Condor shadows etc. could not access the local job files.
This caused(?) a large number of jobs to be seen as failed by condor and
sending them to hold.
Apparently, condor was overwhelmed by the large number of hold jobs
(160.000 jobs in hold, /var/lib/condor/spool.old.20170214/job_queue.log
already at ~1.4GB). Plain removing the hold jobs with condor_rm failed
accordingly(?), so that we moved the spool dir away and gave condor a
fresh start.

Since then, the node has been running fine again.

Cheers,
  Thomas

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature