[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] One sub-job repeating over and over...



How big are the output files?  Are they over 2GB?  If so that is the problem.
Condor's file transfer mechanism can't handle files over 2GB.

On Tue, Dec 09, 2003 at 02:05:49PM -0500, Heinz, Michael William wrote:
> Yes, Dan
> 
> You understand correctly. An additional note should be that all 10 jobs are
> the same job with a different starting argument. In otherwords, job 1 is
> told to process frames 1-10, job 2 does frames 11-20 and so on. All the jobs
> finish but the one, which is only different (as far as I can tell) in that
> it is processing frames 50-60. The problem occurs on any machine this
> particular job runs on. Manually inspecting the execute directory confirms
> that the job really is finishing all the output files, but then failing when
> it tries to upload them back to the manager.
> 
> I've worked around the problem (for now) by dividing into 20 jobs instead of
> 10. 
> 
> -----Original Message-----
> From: owner-condor-users@xxxxxxxxxxx [mailto:owner-condor-users@xxxxxxxxxxx]
> On Behalf Of Dan Bradley
> Sent: Tuesday, December 09, 2003 1:44 PM
> To: condor-users@xxxxxxxxxxx
> Subject: Re: [condor-users] One sub-job repeating over and over...
> 
> 
> 
> 
> Heinz, Michael William wrote:
> 
> >Except... One of the tasks is completing, then failing to send the 
> >files back to the central manager and, next, the central manager starts 
> >the job over!
> >
> 
> Let me make sure I understand the situation.  You are submitting 10 jobs 
> and 9 of them are finishing successfully.  The remaining one is running 
> to completion but is then failing to send back output files, which 
> causes it to remain in the queue to be rescheduled.
> 
> If that is right, then first verify that there is nothing different 
> about the job that is failing.  Presumably all the successful jobs are 
> producing output files and successfully sending them back?  If file 
> stage-backs are failing in general, then this is likely to be a 
> different problem from the case where stage-backs are working for all 
> but a fraction of your jobs.
> 
> If there is nothing special about the job, then check to see if the 
> failure always happens when the job runs on a specific machine.  You can 
> see which machine is executing a job by looking in the user log file for 
> the job (whatever was specified by log = X in the submit file).
> 
> Dan Bradley
> University of Wisconsin, Condor Project
> 
> 
> Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe
> condor-users <your_email_address>
> 
> 
> 
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>