[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Sometimes jobs do not start due to timeout problems



Hi everyone,

 

We are currently running condor 8.4.9. on a Windows server with Windows 7 clients.

 

It occasionally happens that some jobs may not be correctly started due to timeout problems. We really don’t understand why and when this happens and we hope that you could help us finding the problem.

It seems to happen less often when setting STARTD_DEBUG to FULLDEBUG level on the client. Does this hint to some race condition?

 

So what happens if it doesn’t work is the following:

-       The job is correctly matched, the client is set to claimed. The job is listed as running.

-       After 20 seconds a timeout happens to a port, which I did not find in any of the server log files.

-       Job is preempted/vacated

 

LogFile of Client StartLog:

03/09/17 11:23:11 slot1: Schedd addr = <10.10.11.116:49578?addrs=10.10.11.116-49578>

03/09/17 11:23:11 slot1: Alive interval = 300

03/09/17 11:23:11 slot1: Received ClaimId from schedd (<10.10.40.23:49295>#1489054871#1#...)

03/09/17 11:23:11 slot1: Rank of this claim is: 0.000000

03/09/17 11:23:11 slot1: Request accepted.

03/09/17 11:23:11 IPVERIFY: checking server.mydomain.net against 10.10.11.116

03/09/17 11:23:11 IPVERIFY: matched 10.10.11.116 to 10.10.11.116

03/09/17 11:23:11 IPVERIFY: ip found is 1

03/09/17 11:23:11 slot1: Remote owner is michael.name@xxxxxxxxxxxx

03/09/17 11:23:11 slot1: State change: claiming protocol successful

03/09/17 11:23:11 slot1: Changing state: Owner -> Claimed

03/09/17 11:23:11 slot1: Started ClaimLease timer (12) w/ 1800 second lease duration

03/09/17 11:23:31 condor_read(): timeout reading 5 bytes from <10.10.11.116:49768>.

03/09/17 11:23:31 IO: Failed to read packet header

03/09/17 11:23:31 Can't read ClaimId

03/09/17 11:23:31 Trying to update collector <10.10.11.116:9618>

03/09/17 11:23:31 Attempting to send update via TCP to collector <10.10.11.116:9618>

03/09/17 11:23:31 slot1: Sent update to 1 collector(s)

03/09/17 11:23:53 slot1: Called deactivate_claim()

03/09/17 11:23:53 slot1: State change: received RELEASE_CLAIM command

03/09/17 11:23:53 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating

03/09/17 11:23:53 slot1: Canceled ClaimLease timer (12)

03/09/17 11:23:53 slot1: State change: No preempting claim, returning to owner

03/09/17 11:23:53 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle

03/09/17 11:23:57 Trying to update collector <10.10.11.116:9618>

03/09/17 11:23:57 Attempting to send update via TCP to collector <10.10.11.116:9618>

03/09/17 11:23:57 slot1: Sent update to 1 collector(s)

 

 

ShadowLog of server:

03/09/17 11:23:12 ******************************************************

03/09/17 11:23:12 ** condor_shadow (CONDOR_SHADOW) STARTING UP

03/09/17 11:23:12 ** C:\condor\bin\condor_shadow.exe

03/09/17 11:23:12 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

03/09/17 11:23:12 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

03/09/17 11:23:12 ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $

03/09/17 11:23:12 ** $CondorPlatform: x86_64_Windows8 $

03/09/17 11:23:12 ** PID = 3612

03/09/17 11:23:12 ** Log last touched 3/9 10:36:14

03/09/17 11:23:12 ******************************************************

03/09/17 11:23:12 Using config source: \\server\Condor\config\condor_config

03/09/17 11:23:12 Using local config sources:

03/09/17 11:23:12    C:\QueueConfig\local\condor_config.local

03/09/17 11:23:12 config Macros = 109, Sorted = 109, StringBytes = 10628, TablesBytes = 896

03/09/17 11:23:12 CLASSAD_CACHING is OFF

03/09/17 11:23:12 Daemon Log is logging: D_ALWAYS D_ERROR

03/09/17 11:23:12 Daemoncore: Listening at <0.0.0.0:49761> on TCP (ReliSock) and UDP (SafeSock).

03/09/17 11:23:12 DaemonCore: command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761>

03/09/17 11:23:12 DaemonCore: private command socket at <10.10.11.116:49761?addrs=10.10.11.116-49761>

03/09/17 11:23:12 Initializing a VANILLA shadow for job 900.0

03/09/17 11:23:53 (900.0) (3612): condor_write(): Socket closed when trying to write 4096 bytes to startd client01.mydomain.net, fd is 24

03/09/17 11:23:53 (900.0) (3612): Buf::write(): condor_write() failed

03/09/17 11:23:53 (900.0) (3612): client01.mydomain.net: DCStartd::activateClaim: Failed to send job ClassAd to the startd

03/09/17 11:23:53 (900.0) (3612): Job 900.0 is being evicted from client01.mydomain.net

03/09/17 11:23:53 (900.0) (3612): logEvictEvent with unknown reason (108), not logging.

03/09/17 11:23:53 (900.0) (3612): **** condor_shadow (condor_SHADOW) pid 3612 EXITING WITH STATUS 108

 

I am not sure where this port 49768 is heading to? I could not find any log on the server listing this port number. I am not even sure if I should search in ShadowLog or somewhere else? Should I increase log level of any of the server daemons?

 

In the old mails of your list I found two similar problems all hinting to some network problems. However, the fact that sometimes everything just works smoothly makes me wonder if this could really be a problem of our network? Furthermore, I am not sure how to address this?

This problem happens on several clients similarly.

 

We are thankful for any comments or ideas!

 

Thanks,

Best,

Michael