[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Sometimes jobs do not start due to timeout problems



Yes, this is a sort of race, but one that is relatively harmless – it slows things down a bit but doesn’t result in any long term harm.  

 

What’s happening is that the SCHEDD gets a match and contacts the STARTD to claim the slot, then it goes back to doing other processing.  (perhaps handling a condor_q request, or doing some garbage collection in the job queue)

If it doesn’t’ get back around to continuing the conversation with the STARTD to complete the claim within 20 seconds, then that claim

Is lost and the slot goes back into to idle state and to be matched again the next negotiation cycle.

 

Adding D_FULLDEBUG to the STARTD’s log file will slow the STARTD down somewhat, giving the SCHEDD a little more time to make the claim,

But the real problem here is that the SCHEDD is heavily loaded and not able to process all of the matches that it gets the first time it gets them.

 

You could treat this as just part of doing business, or you can look into what is slowing down the SCHEDD and try to address that.

 

Try running

 

condor_status -schedd -direct <schedd-name> -long -statistics ALL:2 > schedd.ad

 

Have a look at the RecentDeamonCoreDutyCycle attribute in schedd.ad.  If it is more than about 0.95 then the SCHEDD is overloaded.

Also make sure that the attributes that start with SelfMonitor are relatively sane.  In particular, is the SCHEDD using more memory than the machine has?

 

You can also look at the statistics in the schedd.ad for operations that are taking more than a few seconds.   Look for attributes that match *RuntimeMax That are > 5.  These would be operations that sometimes take more than 5 seconds, and during that time the SCHEDD will not be doing anything else (like finishing a claim).

 

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Schwarzfischer
Sent: Thursday, March 9, 2017 10:09 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Sometimes jobs do not start due to timeout problems

 

Hi everyone,

 

We are currently running condor 8.4.9. on a Windows server with Windows 7 clients.

 

It occasionally happens that some jobs may not be correctly started due to timeout problems. We really don’t understand why and when this happens and we hope that you could help us finding the problem.

It seems to happen less often when setting STARTD_DEBUG to FULLDEBUG level on the client. Does this hint to some race condition?

 

So what happens if it doesn’t work is the following:

-       The job is correctly matched, the client is set to claimed. The job is listed as running.

-       After 20 seconds a timeout happens to a port, which I did not find in any of the server log files.

-       Job is preempted/vacated

 

LogFile of Client StartLog:

03/09/17 11:23:11 slot1: Schedd addr = <10.10.11.116:49578?addrs=10.10.11.116-49578>

03/09/17 11:23:11 slot1: Alive interval = 300

03/09/17 11:23:11 slot1: Received ClaimId from schedd (<10.10.40.23:49295>#1489054871#1#...)

03/09/17 11:23:11 slot1: Rank of this claim is: 0.000000

03/09/17 11:23:11 slot1: Request accepted.

03/09/17 11:23:11 IPVERIFY: checking server.mydomain.net against 10.10.11.116

03/09/17 11:23:11 IPVERIFY: matched 10.10.11.116 to 10.10.11.116

03/09/17 11:23:11 IPVERIFY: ip found is 1

03/09/17 11:23:11 slot1: Remote owner is michael.name@xxxxxxxxxxxx

03/09/17 11:23:11 slot1: State change: claiming protocol successful

03/09/17 11:23:11 slot1: Changing state: Owner -> Claimed

03/09/17 11:23:11 slot1: Started ClaimLease timer (12) w/ 1800 second lease duration

03/09/17 11:23:31 condor_read(): timeout reading 5 bytes from <10.10.11.116:49768>.

03/09/17 11:23:31 IO: Failed to read packet header

03/09/17 11:23:31 Can't read ClaimId

03/09/17 11:23:31 Trying to update collector <10.10.11.116:9618>

03/09/17 11:23:31 Attempting to send update via TCP to collector <10.10.11.116:9618>

03/09/17 11:23:31 slot1: Sent update to 1 collector(s)

03/09/17 11:23:53 slot1: Called deactivate_claim()

03/09/17 11:23:53 slot1: State change: received RELEASE_CLAIM command

03/09/17 11:23:53 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/09/17 11:23:53 slot1: Canceled ClaimLease timer (12)

03/09/17 11:23:53 slot1: State change: No preempting claim, returning to owner

03/09/17 11:23:53 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/09/17 11:23:57 Trying to update collector <10.10.11.116:9618>

03/09/17 11:23:57 Attempting to send update via TCP to collector <10.10.11.116:9618>

03/09/17 11:23:57 slot1: Sent update to 1 collector(s)

 

 

ShadowLog of server:

03/09/17 11:23:12 ******************************************************

03/09/17 11:23:12 ** condor_shadow (CONDOR_SHADOW) STARTING UP

03/09/17 11:23:12 ** C:\condor\bin\condor_shadow.exe

03/09/17 11:23:12 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

03/09/17 11:23:12 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

03/09/17 11:23:12 ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $

03/09/17 11:23:12 ** $CondorPlatform: x86_64_Windows8 $

03/09/17 11:23:12 ** PID = 3612

03/09/17 11:23:12 ** Log last touched 3/9 10:36:14

03/09/17 11:23:12 ******************************************************

03/09/17 11:23:12 Using config source: \\server\Condor\config\condor_config

03/09/17 11:23:12 Using local config sources:

03/09/17 11:23:12    C:\QueueConfig\local\condor_config.local

03/09/17 11:23:12 config Macros = 109, Sorted = 109, StringBytes = 10628, TablesBytes = 896

03/09/17 11:23:12 CLASSAD_CACHING is OFF

03/09/17 11:23:12 Daemon Log is logging: D_ALWAYS D_ERROR

03/09/17 11:23:12 Daemoncore: Listening at <0.0.0.0:49761> on TCP (ReliSock) and UDP (SafeSock).

03/09/17 11:23:12 DaemonCore: command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761>

03/09/17 11:23:12 DaemonCore: private command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761>

03/09/17 11:23:12 Initializing a VANILLA shadow for job 900.0

03/09/17 11:23:53 (900.0) (3612): condor_write(): Socket closed when trying to write 4096 bytes to startd client01.mydomain.net, fd is 24

03/09/17 11:23:53 (900.0) (3612): Buf::write(): condor_write() failed

03/09/17 11:23:53 (900.0) (3612): client01.mydomain.net: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/09/17 11:23:53 (900.0) (3612): Job 900.0 is being evicted from client01.mydomain.net

03/09/17 11:23:53 (900.0) (3612): logEvictEvent with unknown reason (108), not logging.

03/09/17 11:23:53 (900.0) (3612): **** condor_shadow (condor_SHADOW) pid 3612 EXITING WITH STATUS 108

 

I am not sure where this port 49768 is heading to? I could not find any log on the server listing this port number. I am not even sure if I should search in ShadowLog or somewhere else? Should I increase log level of any of the server daemons?

 

In the old mails of your list I found two similar problems all hinting to some network problems. However, the fact that sometimes everything just works smoothly makes me wonder if this could really be a problem of our network? Furthermore, I am not sure how to address this?

This problem happens on several clients similarly.

 

We are thankful for any comments or ideas!

 

Thanks,

Best,

Michael