[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_write failures



Hello,

I use Condor with the Vanilla universe on Windows (Execute node
version is 7.2.4, Collector\Schedd\Submit\etc is 7.0.1). Recently, I
have been seeing jobs restart randomly. In trying to troubleshoot the
problem, I looked at the logs on the execute nodes that were running
the jobs, I have noticed messages like this in the 'StarterLog':

3/4 07:25:19 Job 12112.0 set to execute immediately
3/4 07:25:19 Starting a VANILLA universe job with ID: 12112.0
3/4 07:25:19 IWD: C:\condor\execute\dir_2808
3/4 07:25:19 Input file: C:\condor\execute\dir_2808\suite.lst
3/4 07:25:19 Output file: C:\condor\execute\dir_2808\result.txt
3/4 07:25:19 Error file: C:\condor\execute\dir_2808\result.txt
3/4 07:25:19 Renice expr "10" evaluated to 10
3/4 07:25:19 About to exec c:\windows\system32\cmd.exe /c perl 1267661582.pl
3/4 07:25:19 Create_Process succeeded, pid=3972
3/4 10:40:49 condor_read(): recv() returned -1, errno = 10054,
assuming failure reading 5 bytes from <10.127.140.10:3799>.
3/4 10:40:49 IO: Failed to read packet header
3/4 10:45:27 condor_write(): Socket closed when trying to write 198
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 10:45:27 Buf::write(): condor_write() failed
3/4 10:50:27 condor_write(): Socket closed when trying to write 198
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 10:50:27 Buf::write(): condor_write() failed
3/4 10:55:27 condor_write(): Socket closed when trying to write 199
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 10:55:27 Buf::write(): condor_write() failed
3/4 11:00:27 condor_write(): Socket closed when trying to write 198
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 11:00:27 Buf::write(): condor_write() failed
3/4 11:05:27 condor_write(): Socket closed when trying to write 198
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 11:05:27 Buf::write(): condor_write() failed
3/4 11:10:27 condor_write(): Socket closed when trying to write 198
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 11:10:27 Buf::write(): condor_write() failed
3/4 11:15:27 condor_write(): Socket closed when trying to write 199
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 11:15:27 Buf::write(): condor_write() failed
3/4 11:20:27 condor_write(): Socket closed when trying to write 199
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054
3/4 11:20:27 Buf::write(): condor_write() failed
3/4 11:25:27 condor_write(): Socket closed when trying to write 199
bytes to <10.127.140.10:3799>, fd is 1788, errno=10054


In looking at the collector \ submit \ etc side of things, I am seeing
the following in the ShadowLog:

3/4 12:36:46 (12112.0) (1748): condor_read(): recv() returned -1,
errno = 10054, assuming failure reading 5 bytes from
<10.127.248.239:1038>.
3/4 12:36:46 (12112.0) (1748): IO: Failed to read packet header
3/4 12:36:46 (12112.0) (1748): Can no longer talk to condor_starter
<10.127.248.239:1038>
3/4 12:36:46 (12112.0) (1748): Trying to reconnect to disconnected job
3/4 12:36:46 (12112.0) (1748): LastJobLeaseRenewal: 1267717005 Thu Mar
04 10:36:45 2010
3/4 12:36:46 (12112.0) (1748): JobLeaseDuration: 1200 seconds
3/4 12:36:46 (12112.0) (1748): JobLeaseDuration remaining: EXPIRED!
3/4 12:36:46 (12112.0) (1748): Reconnect FAILED: Job disconnected too
long: JobLeaseDuration (1200 seconds) expired
3/4 12:36:46 (12112.0) (1748): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 107



It looks like to me like the execute node is unable to communicate
with the shadow process and then after the joblease duration has
passed, the shadow is exited and thus the job is vacated and
restarted. The issue here is that the network appears to be healthy
and if there was a connectivity loss, I don't see it persisting over
the 30+ minutes that the execute node is unable to connect. I see that
errno 10054 is WSAECONNRESET. Is there any reason why the shadow
process would be resetting the connection? Are there any more detailed
logs on the shadow process side of things that I can go and look
at.... any other ideas or outstanding issues that might be related?

Thanks in advance for any help you can provide.

-Adam