Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error

Date: Wed, 30 Aug 2017 11:51:49 -0500 (CDT)
From: Todd L Miller <tlmiller@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error

The above content of the job.log confuses me, clearly the job had runfor 20 seconds, why had the job.log not been updated to include themessage that the job was executing on host xxxx?

Because the job never started. The job goes into the run statewhen the schedd forks its shadow, not when the starter actually starts thejob. This usually doesn't matter much -- although it can if file transferin is slow enough -- but in this case it's confusing.

As for the shadow log, if I recall correctly, the job exits therun state after file transfer out finishes -- it doesn't wait for thestarter (or startd) to finish cleaning up the job sandbox. Therefore,

HTCondor can try to start a job in a slot before it's been cleaned up

from the previous job. Rather than wait indefinitely, the shadow gives upif the slot's not ready after twenty seconds, exiting with code 108, which

the manual defines as JOB_NOT_STARTED.

The startd log is a little less useful -- the StarterLog for thegiven slot may have more information. At any rate, accounting for whatappears to be an 8-second clock difference, the stories match up: it takesthe starter 21 seconds to clean up the job directory, it accepts the jobafter the shadow restart, and then decides that the negotiator was wrongabout the job actually matching, and kicks it back off.

08/29/17 17:51:38 (fd:4) (pid:4060) (D_ALWAYS) ERROR: SharedPortClient:Failed to open named pipe id '1904_30e0_4' as requested by STARTD<10.122.227.253:9618?addrs=10.122.227.253-9618&noUDP&sock=3696_5e98_3>on <10.122.227.253:56884> for sending socket: 2 The system cannot findthe file specified.


	This probably just means that someone tried to contact a starter

after it had killed itself. You should be able to find the named pipe idin elsewhere in the HTCondor logs of the machine that produced this error;it will show up after the string 'sock='; the third line quoted above isan example.

The SharedPortClient error appeared to occur around the 20 second markfor when the job got evicted again.

The starter that's trying to clean up may finish, give up, or bekilled at this point. (The startd should try to finish cleaning up if thestarter doesn't exit cleanly.)

Maybe this is somehow related to how the execute machine is being sharedbetween multiple central managers?

That's more likely to cause weird timing errors. The other thingthat may be worth checking is if the slot were matched by more than onenegotiator. (Check the match log of all your central managers.) Idon't know what that would like in the logs, or to the job, but it's aninevitable part of reporting to more than one CM.


- ToddM

Follow-Ups:
- Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error
  - From: Skrzypczyk, Matthew

References:
- [HTCondor-users] Jobs Becoming Idle SharedPortClient Error
  - From: Skrzypczyk, Matthew

Prev by Date: [HTCondor-users] Transfer out files question
Next by Date: Re: [HTCondor-users] Transfer out files question
Previous by thread: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error
Next by thread: Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error