[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs Becoming Idle SharedPortClient Error



Hello, we are using HTCondor 8.6.1 and have started exhibiting some strange issues with certain jobs.  For preface, the errors occur on a Windows machine that belongs to multiple condor pools, meaning that the CONDOR_HOST field in the condor config contains a comma delimited list of the IP’s of each central manager the machine reports to.  The issue that we are seeing is not deterministic in nature, but the issue is easily reproducible by having a queue of jobs for this Windows machine on the scheduler.

 

So, for some reason we will see the following in the condor queue:

ubuntu@htcondorscheduler1:/disk-root/condor/log$ sudo condor_q

186.0   condor          8/24 18:38   0+00:00:21 I  0     0.0 xxxxxxxx

 

 

This job typically runs for ~5 minutes, so it was odd that the job had gone into an idle state after 20 seconds.  Checking the job.log:

 
000 (186.000.000) 08/24 18:38:47 Job submitted from host: <10.123.25.101:9618?addrs=10.123.25.101-9618+[--1]-9618&noUDP&sock=1156_2061_3>
    DAG Node: stage_1100
...

 

The above content of the job.log confuses me, clearly the job had run for 20 seconds, why had the job.log not been updated to include the message that the job was executing on host xxxx?  I checked the ShadowLog on the scheduling machine and found the following:

 

08/24/17 18:49:07 (?.?) (4906): Initializing a VANILLA shadow for job 186.0
08/24/17 18:49:07 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:08 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:09 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:10 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:11 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:12 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:13 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:14 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:15 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:16 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:17 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:18 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:19 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:20 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:21 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:22 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:23 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:24 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:25 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:26 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:27 (186.0) (4906): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was DELAYED (previous job still being vacated)
08/24/17 18:49:27 (186.0) (4906): activateClaim(): Too many retries, giving up.
08/24/17 18:49:27 (186.0) (4906): Job 186.0 is being evicted from tlp1
08/24/17 18:49:27 (186.0) (4906): logEvictEvent with unknown reason (108), not logging.
08/24/17 18:49:27 (186.0) (4906): **** condor_shadow (condor_SHADOW) pid 4906 EXITING WITH STATUS 108

 

Followed by:

08/24/17 18:49:42 Initializing a VANILLA shadow for job 186.0
08/24/17 18:49:42 (186.0) (7109): Request to run on tlp1 <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> was REFUSED
08/24/17 18:49:42 (186.0) (7109): Job 186.0 is being evicted from tlp1
08/24/17 18:49:42 (186.0) (7109): logEvictEvent with unknown reason (108), not logging.
08/24/17 18:49:42 (186.0) (7109): **** condor_shadow (condor_SHADOW) pid 7109 EXITING WITH STATUS 108

 

The execute machine had the following within it’s StartLog:

08/24/17 18:44:37 slot1: Changing activity: Idle -> Busy
08/24/17 18:48:59 slot1: Called deactivate_claim_forcibly()
08/24/17 18:48:59 slot1: Got activate claim while starter is still alive.
08/24/17 18:48:59 slot1: Telling shadow to try again later.
08/24/17 18:49:00 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:00 slot1: Telling shadow to try again later.
08/24/17 18:49:01 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:01 slot1: Telling shadow to try again later.
08/24/17 18:49:02 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:02 slot1: Telling shadow to try again later.
08/24/17 18:49:03 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:03 slot1: Telling shadow to try again later.
08/24/17 18:49:04 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:04 slot1: Telling shadow to try again later.
08/24/17 18:49:06 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:06 slot1: Telling shadow to try again later.
08/24/17 18:49:07 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:07 slot1: Telling shadow to try again later.
08/24/17 18:49:08 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:08 slot1: Telling shadow to try again later.
08/24/17 18:49:09 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:09 slot1: Telling shadow to try again later.
08/24/17 18:49:10 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:10 slot1: Telling shadow to try again later.
08/24/17 18:49:11 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:11 slot1: Telling shadow to try again later.
08/24/17 18:49:12 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:12 slot1: Telling shadow to try again later.
08/24/17 18:49:13 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:13 slot1: Telling shadow to try again later.
08/24/17 18:49:14 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:14 slot1: Telling shadow to try again later.
08/24/17 18:49:15 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:15 slot1: Telling shadow to try again later.
08/24/17 18:49:16 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:16 slot1: Telling shadow to try again later.
08/24/17 18:49:17 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:17 slot1: Telling shadow to try again later.
08/24/17 18:49:18 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:18 slot1: Telling shadow to try again later.
08/24/17 18:49:19 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:19 slot1: Telling shadow to try again later.
08/24/17 18:49:20 slot1: Got activate claim while starter is still alive.
08/24/17 18:49:20 slot1: Telling shadow to try again later.
08/24/17 18:49:20 slot1: Called deactivate_claim()
08/24/17 18:49:20 slot1: State change: received RELEASE_CLAIM command
08/24/17 18:49:20 slot1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
08/24/17 18:49:20 Starter pid 3936 exited with status 0
08/24/17 18:49:20 slot1: State change: starter exited
08/24/17 18:49:20 slot1: State change: No preempting claim, returning to owner
08/24/17 18:49:20 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
08/24/17 18:49:20 slot1: State change: IS_OWNER is false
08/24/17 18:49:20 slot1: Changing state: Owner -> Unclaimed
08/24/17 18:49:34 slot1: Request accepted.
08/24/17 18:49:34 slot1: Remote owner is condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
08/24/17 18:49:34 slot1: State change: claiming protocol successful
08/24/17 18:49:34 slot1: Changing state: Unclaimed -> Claimed
08/24/17 18:49:34 slot1: Got activate_claim request from shadow (10.123.25.101)
08/24/17 18:49:34 slot1: Job Requirements check failed!
08/24/17 18:49:34 slot1: Called deactivate_claim()
08/24/17 18:49:34 slot1: State change: received RELEASE_CLAIM command
08/24/17 18:49:34 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
08/24/17 18:49:34 slot1: State change: No preempting claim, returning to owner
08/24/17 18:49:34 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
08/24/17 18:49:34 slot1: State change: IS_OWNER is false
08/24/17 18:49:34 slot1: Changing state: Owner -> Unclaimed
08/24/17 18:53:30 slot1: Request accepted.

 

And the following on the Central Manager within the MatchLog:

08/24/17 18:49:42       Matched 186.0 condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <10.123.25.101:9618?addrs=10.123.25.101-9618+[--1]-9618&noUDP&sock=1156_2061_3> preempting none <10.123.25.125:9618?addrs=10.123.25.125-9618&noUDP&sock=2816_4824_3> tlp1

 

So it seems that there is evidence that the job was actually matched to the execute machine, though the job.log never reflected this information…

 

After observing the errors above I turned on D_ALL debugging for the COLLECTOR_DEBUG, MASTER_DEBUG, SCHEDD_DEBUG, SHADOW_DEBUG, STARTD_DEBUG, SHARED_PORT_DEBUG, and STARTER_DEBUG to see if I could get any additional information that might help reveal the cause, it appears that the following error message (that I observed upon subsequent efforts to reproduce the issue) may be related:

 

08/29/17 17:51:38 (fd:4) (pid:4060) (D_COMMAND) Calling HandleReq <SharedPortServer::HandleConnectRequest> (0) for command 75 (SHARED_PORT_CONNECT) from unauthenticated@unmapped <10.122.227.253:56884>

08/29/17 17:51:38 (fd:4) (pid:4060) (D_ALWAYS) ERROR: SharedPortClient: Failed to open named pipe id '1904_30e0_4'  as requested by STARTD <10.122.227.253:9618?addrs=10.122.227.253-9618&noUDP&sock=3696_5e98_3> on <10.122.227.253:56884> for sending socket: 2 The system cannot find the file specified.

08/29/17 17:51:38 (fd:4) (pid:4060) (D_COMMAND) Return from HandleReq <SharedPortServer::HandleConnectRequest> (handler: 0.000125s, sec: 0.001s, payload: 0.000s)

 

The SharedPortClient error appeared to occur around the 20 second mark for when the job got evicted again.  There is no authentication between the machines done within our pools so I don’t believe this is related to that.  This instance is hosted on EC2 and there isn’t any other software that should be occupying the ports used by condor, so I am not sure why this error is occurring.  Maybe this is somehow related to how the execute machine is being shared between multiple central managers?  The b