[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Diagnosing condor_q failures and unresponsiveness



Hi Boris,

It sounds like the condor_schedd is too busy to respond to queries or other daemons. I would check the SchedLog and see what it's doing.

I'd also recommend rounding your request_memory values to improve the autoclustering (e.g. up to the next GB). Be aware that changes to the request memory value while the job is running won't affect the amount it currently has reserved. This could lead to machines running out of memory if your <min_memory> is lower than the actual usage for most jobs.

Best,
Collin

On Sun, Jun 2, 2019 at 10:09 PM Boris Sadkhin <bsadkhin.anl@xxxxxxxxx> wrote:
I recently redeployed condor with some changes and I have been running
into the following issues:


1) condor_q hangs and is intermittently failing with
==========================
condor_q -debug
06/02/19 23:04:52 condor_read(): timeout reading 5 bytes from schedd
at <x.x.x.x:49139>.
06/02/19 23:04:52 IO: Failed to read packet header
06/02/19 23:04:52 SECMAN: no classad from server, failing
-- Failed to fetch ads from: <x.x.x.x:49139> : proda-dock01
SECMAN:2007:Failed to end classad message.
============================
2) condor_submit hangs for a long time before submitting a job
3) jobs are sitting up to 15 minutes idle, and are not running, even
though there are plenty of resources to use.
4) condor_q displays really long *AutoCluster attributes* that weren't
there before
5) the negotiator log displays
===================================
06/02/19 23:53:01 condor_read(): timeout reading 21 bytes from schedd
<acctgroup>@proda-dock01.
06/02/19 23:53:01 IO: Failed to read packet header
06/02/19 23:53:01Â Â ÂFailed to get reply from schedd
06/02/19 23:53:01Â ÂError: Ignoring submitter for this cycle
06/02/19 23:53:01Â negotiateWithGroup resources used submitterAds length 0
==================================
6) some jobs are put on hold with
Error from slot1_1@chicagoawe162: STARTER at x.x.x.x failed to send
file(s) to <x.x.x.x.:9618>; SHADOW at x.x.x.x failed to write to file
/usr/local/condor/lib/condor/spool/114/0/cluster114.proc0.subproc0/_condor_stdout:
(errno 2) No such file or directory
7) various FD errors in the logs
* CollectorLog: condor_write(): Socket closed when trying to write 62
bytes to SHADOW fd is 119 or fd is 120
*ShadowLog: condor_write(): Socket closed when trying to write 976
bytes to startd
*ToolLog condor_write(): Socket closed when trying to write 280 bytes
to <x.x.x.x:60653>, fd is 16

Some of the recent configuration changes made were:
===================================
* Enabling the EVENT_LOG
* Setting a Concurrency_limit_default
===================================
As well as updating submit nodes with a
* StartD Cron job (in order to do a health check of our services
before submitting a job)
===================================
And updated *submit files* for my portal to
* ConcurrencyLimits = <AccountingGroup>
* changed accounting_group to be +AccountingGroup
* dynamic memory request (request_memory = ifthenelse(MemoryUsage =!=
undefined, MAX({MemoryUsage * 3/2, <min_memory>}))
* MaxJobRetirementTime = 604800
* "Periodic_Remove = ( RemoteWallClockTime > 604800 )");
===================================
In addition, the
* logfile is now specified in the submit file

Any advice on how to diagnose and fix these issues would be appreciated.

Thanks,
Boris
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Collin Mehring | PE-JoSE - Software Engineer