[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Error from starter, jobs put on hold





Michael Hess wrote:

every day a couple (around 20) of jobs from our central submitter are put on hold.
condor_q -l says:

LastHoldReason = "Error from starter on pc-name.ourlocalnetwork.plymouth.ac.uk:
STARTER failed to receive file(s) from <x.x.x.x:19086> Download acknowledgment
missing attribute: Result"
LastHoldReasonCode = 11
LastHoldReasonSubCode = 0

This error is rather unexpected. Todd's suggestion of using PeriodicRelease will let you work around the problem, but I am very curious what is causing it. Honestly, the only thing I can think of that would cause this to happen sporadically, as you describe, is some kind of memory or network data corruption. However, it may also be that we are not correctly detecting some other "normal" network error, such as closing of the socket while the final transmission is in progress. Either way, I'd like to figure out what is going on.

When 6.8.4 comes out, there will be some extra information in the StarterLog accompanying this error message. This should help us see whether the "download acknowledgment" contains corrupted data.

--Dan