[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedd becomes unresponsible for unknown reason.



Dear Colleagues,

It is indeed a significant issue for us at the moment â the stability question is always a very subtle aspect. It would be nice if some HT Condor expert probably someone from developers could take a look at this, please.

All the best
Alexander A. Prokhorov

On 30 Oct 2020, at 18:43, Sergey A. Komissarov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hello,

We expirienced a strange problem. One executor machine in our cluster was powered off because of memory consumption. And after some time (approx 6-8 hours) we can not send jobs to the HTCondor.

It is even impossible to run condor_q on the schedd, only 1 attempt from 10 gives correct result:

root@pseven-htcondorsubmit-deploy-76897c9667-jvwpw:/# condor_q -debug
10/30/20 14:02:44 condor_read(): timeout reading 5 bytes from schedd at <10.244.0.181:38415>.
10/30/20 14:02:44 IO: Failed to read packet header
10/30/20 14:02:44 SECMAN: no classad from server, failing

-- Failed to fetch ads from: <10.244.0.181:38415?alias=submit.pseven-htcondor> : submit.pseven-htcondor
SECMAN:2007:Failed to end classad message.


In the logs of the schedd I see the following messages:

[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:13.765961773Z condor_schedd[830]: Resource slot6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-htcondor has been unused for 75279 seconds, limit is 600, releasing
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:23.777247911Z condor_schedd[830]: attempt to connect to <10.244.1.86:19618> failed: timed out after 10 seconds.
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:23.777295573Z condor_schedd[830]: ERROR in releaseClaim(): canot connect to startd <10.244.1.86:19618?addrs=10.244.1.86-19618&alias=pseven-htcondorexecute-deploy-cd78cd7b-8pr96.pseven-htcondor&noUDP&sock=startd_790_c450_3>
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:23.777299817Z condor_schedd[830]: Resource slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-htcondor has been unused for 75289 seconds, limit is 600, releasing
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.788898299Z condor_schedd[830]: attempt to connect to <10.244.1.86:19618> failed: timed out after 10 seconds.
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.788940145Z condor_schedd[830]: ERROR in releaseClaim(): canot connect to startd <10.244.1.86:19618?addrs=10.244.1.86-19618&alias=pseven-htcondorexecute-deploy-cd78cd7b-8pr96.pseven-htcondor&noUDP&sock=startd_790_c450_3>
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.788947231Z condor_schedd[830]: Resource slot4@xxxxxxxxxxxxxxxxxxxxx has been unused for 65395 seconds, limit is 600, releasing
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.791301096Z condor_schedd[830]: CCBClient: received failure message from CCB server collector 10.244.0.34:19618?addrs=10.244.0.34-19618&alias=pseven-htcondormanager-deploy-556d67f945-kqcp9.pseven-htcondor&noUDP&sock=collector in response to request for reversed connection to <127.0.0.1:9618>: CCB server rejecting request for ccbid 312 because no daemon is currently registered with that id (perhaps it recently disconnected).
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.791326066Z condor_schedd[830]: Failed to reverse connect to <127.0.0.1:9618> via CCB.
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.791345397Z condor_schedd[830]: ERROR in releaseClaim(): canot connect to startd <127.0.0.1:9618?CCBID=10.244.0.34:19618%3faddrs%3d10.244.0.34-19618%26alias%3dpseven-htcondormanager-deploy-556d67f945-kqcp9.pseven-htcondor%26noUDP%26sock%3dcollector#312&addrs=127.0.0.1-9618&alias=T3500A.tp.tpnet.intra&noUDP&sock=startd_3528_1ca4_3>
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:33.791351697Z condor_schedd[830]: Resource slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-htcondor has been unused for 75299 seconds, limit is 600, releasing
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:43.803363553Z condor_schedd[830]: attempt to connect to <10.244.1.86:19618> failed: timed out after 10 seconds.
[pod/pseven-htcondorsubmit-deploy-76897c9667-jvwpw/htcondorsubmit] 2020-10-30T11:22:43.803407974Z condor_schedd[830]: ERROR in releaseClaim(): canot connect to startd <10.244.1.86:19618?addrs=10.244.1.86-19618&alias=pseven-htcondorexecute-deploy-cd78cd7b-8pr96.pseven-htcondor&noUDP&sock=startd_790_c450_3>

Logs from the condor master at the same time:

[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:19.587846196Z condor_collector[51]: Got QUERY_STARTD_ADS
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:19.589176732Z condor_collector[51]: QueryWorker: forked new worker with id 17878 ( max 4 active 1 
pending 0 )
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:19.592515966Z condor_collector[17878]: (Sending 672 ads in response to query)
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:19.676110678Z condor_collector[17878]: Query info: matched=672; skipped=0; query_time=0.003894; send_time=0.083389; type=Machine; requirements={((DedicatedScheduler == "DedicatedScheduler@parallel_schedd@xxxxxxxxxxxxx-htcondor"))}; locate=0; limit=0; from=SCHEDD; peer=<10.244.0.130:46857>; projection={}; filter_private_ads=0
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088118903Z condor_negotiator[52]: condor_read(): timeout reading 5 bytes from schedd DedicatedScheduler@parallel_schedd@xxxxxxxxxxxxx-htcondor.
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088150814Z condor_negotiator[52]: IO: Failed to read packet header
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088154410Z condor_negotiator[52]: AUTHENTICATE: handshake failed!
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088157421Z condor_negotiator[52]: SECMAN: required authentication with schedd DedicatedScheduler@parallel_schedd@xxxxxxxxxxxxx-htcondor failed, so aborting command NEGOTIATE.
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088161806Z condor_negotiator[52]: ERROR: AUTHENTICATE:1002:Failure performing handshake
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088166464Z condor_negotiator[52]:     Failed to send NEGOTIATE command to DedicatedScheduler@parallel_schedd@xxxxxxxxxxxxx-htcondor (<10.244.0.191:19618?addrs=10.244.0.191-19618&alias=submit.pseven-htcondor&noUDP&sock=schedd_788_651f_11>)
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088381500Z condor_negotiator[52]: Failed to prefetch resource request lists for DedicatedScheduler@parallel_schedd@xxxxxxxxxxxxx-htcondor(<10.244.0.191:19618?addrs=10.244.0.191-19618&alias=submit.pseven-htcondor&noUDP&sock=schedd_788_651f_11>).
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088403355Z condor_negotiator[52]: Prefetch cycle hit deadline of 60; skipping remaining submitters.
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088406858Z condor_negotiator[52]: Prefetch summary: 3 attempted, 0 successful.
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088409647Z condor_negotiator[52]: Phase 4.1:  Negotiating with schedds ...
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088427036Z condor_negotiator[52]:   Negotiating with DedicatedScheduler@parallel_schedd@xxxxxxxxxxxxx-htcondor at <10.244.0.191:19618?addrs=10.244.0.191-19618&alias=submit.pseven-htcondor&noUDP&sock=schedd_788_651f_11>
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088432242Z condor_negotiator[52]: 0 seconds so far for this submitter
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:26.088436129Z condor_negotiator[52]: 0 seconds so far for this schedd
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:29.745601793Z condor_collector[51]: CCB: rejecting request from SCHEDD <10.244.0.130:19618?addrs=10.244.0.130-19618&alias=submit.pseven-htcondor&noUDP&sock=schedd_787_b587_3> on <10.244.0.130:40449> for ccbid 61 because no daemon is currently registered with that id (perhaps it recently disconnected).
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:33.790447523Z condor_collector[51]: CCB: rejecting request from SCHEDD <10.244.0.181:19618?addrs=10.244.0.181-19618&alias=submit.pseven-htcondor&noUDP&sock=schedd_787_b587_3> on <10.244.0.181:33425> for ccbid 312 because no daemon is currently registered with that id (perhaps it recently disconnected).
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:53.938863871Z condor_collector[51]: DC_AUTHENTICATE: authentication of <10.244.0.194:60436> was successful but resulted in a limited authorization which did not include this command (2 UPDATE_MASTER_AD), so aborting.
[pod/pseven-htcondormanager-deploy-556d67f945-kqcp9/htcondormanager] 2020-10-30T11:22:53.938895982Z condor_collector[51]: DC_AUTHENTICATE: Command not authorized, done!

My guess in that we have to tune master security policies and allow schedd to drop inactive slots from the collector. The second guess is that schedd become irresponsive becuase it is constantly trying to poll collector or disconnected execute and somehow it blocks incoming requests. May be we need to tune some network timeouts in order to avoid this.

PS:

$CondorVersion: 8.9.6 Mar 19 2020 BuildID: Debian-8.9.6-1 PackageID: 8.9.6-1 Debian-8.9.6-1 $
$CondorPlatform: X86_64-Ubuntu_18.04 $


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/