[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Diagnosing condor_q failures and unresponsiveness



I recently redeployed condor with some changes and I have been running
into the following issues:


1) condor_q hangs and is intermittently failing with
==========================
condor_q -debug
06/02/19 23:04:52 condor_read(): timeout reading 5 bytes from schedd
at <x.x.x.x:49139>.
06/02/19 23:04:52 IO: Failed to read packet header
06/02/19 23:04:52 SECMAN: no classad from server, failing
-- Failed to fetch ads from: <x.x.x.x:49139> : proda-dock01
SECMAN:2007:Failed to end classad message.
============================
2) condor_submit hangs for a long time before submitting a job
3) jobs are sitting up to 15 minutes idle, and are not running, even
though there are plenty of resources to use.
4) condor_q displays really long *AutoCluster attributes* that weren't
there before
5) the negotiator log displays
===================================
06/02/19 23:53:01 condor_read(): timeout reading 21 bytes from schedd
<acctgroup>@proda-dock01.
06/02/19 23:53:01 IO: Failed to read packet header
06/02/19 23:53:01     Failed to get reply from schedd
06/02/19 23:53:01   Error: Ignoring submitter for this cycle
06/02/19 23:53:01  negotiateWithGroup resources used submitterAds length 0
==================================
6) some jobs are put on hold with
Error from slot1_1@chicagoawe162: STARTER at x.x.x.x failed to send
file(s) to <x.x.x.x.:9618>; SHADOW at x.x.x.x failed to write to file
/usr/local/condor/lib/condor/spool/114/0/cluster114.proc0.subproc0/_condor_stdout:
(errno 2) No such file or directory
7) various FD errors in the logs
* CollectorLog: condor_write(): Socket closed when trying to write 62
bytes to SHADOW fd is 119 or fd is 120
*ShadowLog: condor_write(): Socket closed when trying to write 976
bytes to startd
*ToolLog condor_write(): Socket closed when trying to write 280 bytes
to <x.x.x.x:60653>, fd is 16

Some of the recent configuration changes made were:
===================================
* Enabling the EVENT_LOG
* Setting a Concurrency_limit_default
===================================
As well as updating submit nodes with a
* StartD Cron job (in order to do a health check of our services
before submitting a job)
===================================
And updated *submit files* for my portal to
* ConcurrencyLimits = <AccountingGroup>
* changed accounting_group to be +AccountingGroup
* dynamic memory request (request_memory = ifthenelse(MemoryUsage =!=
undefined, MAX({MemoryUsage * 3/2, <min_memory>}))
* MaxJobRetirementTime = 604800
* "Periodic_Remove = ( RemoteWallClockTime > 604800 )");
===================================
In addition, the
* logfile is now specified in the submit file

Any advice on how to diagnose and fix these issues would be appreciated.

Thanks,
Boris