[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] One sub-job repeating over and over...

Heinz, Michael William wrote:

Except... One of the tasks is completing, then failing to send the files
back to the central manager and, next, the central manager starts the job

Let me make sure I understand the situation. You are submitting 10 jobs and 9 of them are finishing successfully. The remaining one is running to completion but is then failing to send back output files, which causes it to remain in the queue to be rescheduled.

If that is right, then first verify that there is nothing different about the job that is failing. Presumably all the successful jobs are producing output files and successfully sending them back? If file stage-backs are failing in general, then this is likely to be a different problem from the case where stage-backs are working for all but a fraction of your jobs.

If there is nothing special about the job, then check to see if the failure always happens when the job runs on a specific machine. You can see which machine is executing a job by looking in the user log file for the job (whatever was specified by log = X in the submit file).

Dan Bradley
University of Wisconsin, Condor Project

Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>