[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] One sub-job repeating over and over...



Hey, all - I've got a submission that divides itself into 10 sub-tasks. Each
task renders a few frames with povray and sends them back to the manager. If
I run the tasks locally, they all complete successfully.

Except... One of the tasks is completing, then failing to send the files
back to the central manager and, next, the central manager starts the job
over!

Here's the log from the node running the task:

12/9 11:47:36 DoUpload: send file shell_1_54.png
12/9 11:47:36 condor_write(): send() returned -1, timeout=0, errno=10053.
Assuming failure.
12/9 11:47:36 Buf::write(): condor_write() failed
12/9 11:47:36 ReliSock: put_file: Failed to send filesize.
12/9 11:47:36 ERROR "DoUpload: Failed to send file
C:\Condor/execute\dir_900\shell_1_54.png, exiting at 1371

If I understand correctly, 10053 is a connection failure.

Meanwhile, the central manager is reporting that the shadow has lost contact
with condor_starter on the execute machine.

Any suggestions?


Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>