[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] errno = 10054 failure



On 6/29/06, DeVoil, Peter <Peter.DeVoil@xxxxxxxxxxxxxx> wrote:
I folks,

I am having frequent starter failures on windows XP boxes:

6/29 10:15:23 Submitting machine is "ODIN"
6/29 10:15:23 condor_read(): recv() returned -1, errno = 10054, assuming
failure.
6/29 10:15:23 IO: Failed to read packet header
6/29 10:15:23 File transfer failed (status=0).
6/29 10:15:23 ERROR "Failed to transfer files" at line 1219 in file
..\src\condor_starter.V6.1\jic_shadow.C
6/29 10:15:23 ShutdownFast all jobs.

After this, the VM stays unused until I kill the job.

Any ideas/guesses what "errno = 10054" may mean?

I find this a useful resouce to do the first lookup

http://help.netop.com/support/errorcodes/win32_error_codes.htm
for example this is:
10054 An existing connection was forcibly closed by the remote host.
WSAECONNRESET

This doesn't tell you much apart from that the starter didn't seem to
loose the connection, the schedd/shadow did. Since you were tranfering
files I would look at the shadow log, scheddlog then masterlog on the
schedd machine (the to which you submitted the job) at that tmie and
send any errors to the list.

I find an error of this kind is normally due to some aspect of the
file transfer going wrong and the link is simply dropped rather than
any attempt to comunicate why it went wrong (this is a pain but the
condor guys are aware of this and aer putting some better logging in
so you don't have to jump back and forth between logs so much)
Matt