[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error



The above content of the job.log confuses me, clearly the job had run for 20 seconds, why had the job.log not been updated to include the message that the job was executing on host xxxx?

Because the job never started. The job goes into the run state when the schedd forks its shadow, not when the starter actually starts the job. This usually doesn't matter much -- although it can if file transfer in is slow enough -- but in this case it's confusing.

As for the shadow log, if I recall correctly, the job exits the run state after file transfer out finishes -- it doesn't wait for the starter (or startd) to finish cleaning up the job sandbox. Therefore,
HTCondor can try to start a job in a slot before it's been cleaned up
from the previous job. Rather than wait indefinitely, the shadow gives up if the slot's not ready after twenty seconds, exiting with code 108, which
the manual defines as JOB_NOT_STARTED.

The startd log is a little less useful -- the StarterLog for the given slot may have more information. At any rate, accounting for what appears to be an 8-second clock difference, the stories match up: it takes the starter 21 seconds to clean up the job directory, it accepts the job after the shadow restart, and then decides that the negotiator was wrong about the job actually matching, and kicks it back off.

08/29/17 17:51:38 (fd:4) (pid:4060) (D_ALWAYS) ERROR: SharedPortClient: Failed to open named pipe id '1904_30e0_4' as requested by STARTD <10.122.227.253:9618?addrs=10.122.227.253-9618&noUDP&sock=3696_5e98_3> on <10.122.227.253:56884> for sending socket: 2 The system cannot find the file specified.

	This probably just means that someone tried to contact a starter
after it had killed itself. You should be able to find the named pipe id in elsewhere in the HTCondor logs of the machine that produced this error; it will show up after the string 'sock='; the third line quoted above is an example.

The SharedPortClient error appeared to occur around the 20 second mark for when the job got evicted again.

The starter that's trying to clean up may finish, give up, or be killed at this point. (The startd should try to finish cleaning up if the starter doesn't exit cleanly.)

Maybe this is somehow related to how the execute machine is being shared between multiple central managers?

That's more likely to cause weird timing errors. The other thing that may be worth checking is if the slot were matched by more than one negotiator. (Check the match log of all your central managers.) I don't know what that would like in the logs, or to the job, but it's an inevitable part of reporting to more than one CM.

- ToddM