[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] One sub-job repeating over and over...



About 200k per frame, 10 frames.

-----Original Message-----
From: owner-condor-users@xxxxxxxxxxx [mailto:owner-condor-users@xxxxxxxxxxx]
On Behalf Of Sean Murphy
Sent: Tuesday, December 09, 2003 2:11 PM
To: condor-users@xxxxxxxxxxx
Subject: Re: [condor-users] One sub-job repeating over and over...


How big are the output files?  Are they over 2GB?  If so that is the
problem. Condor's file transfer mechanism can't handle files over 2GB.

On Tue, Dec 09, 2003 at 02:05:49PM -0500, Heinz, Michael William wrote:
> Yes, Dan
> 
> You understand correctly. An additional note should be that all 10 
> jobs are the same job with a different starting argument. In 
> otherwords, job 1 is told to process frames 1-10, job 2 does frames 
> 11-20 and so on. All the jobs finish but the one, which is only 
> different (as far as I can tell) in that it is processing frames 
> 50-60. The problem occurs on any machine this particular job runs on. 
> Manually inspecting the execute directory confirms that the job really 
> is finishing all the output files, but then failing when it tries to 
> upload them back to the manager.
> 
> I've worked around the problem (for now) by dividing into 20 jobs 
> instead of 10.
> 
> -----Original Message-----
> From: owner-condor-users@xxxxxxxxxxx 
> [mailto:owner-condor-users@xxxxxxxxxxx]
> On Behalf Of Dan Bradley
> Sent: Tuesday, December 09, 2003 1:44 PM
> To: condor-users@xxxxxxxxxxx
> Subject: Re: [condor-users] One sub-job repeating over and over...
> 
> 
> 
> 
> Heinz, Michael William wrote:
> 
> >Except... One of the tasks is completing, then failing to send the
> >files back to the central manager and, next, the central manager starts 
> >the job over!
> >
> 
> Let me make sure I understand the situation.  You are submitting 10 
> jobs
> and 9 of them are finishing successfully.  The remaining one is running 
> to completion but is then failing to send back output files, which 
> causes it to remain in the queue to be rescheduled.
> 
> If that is right, then first verify that there is nothing different
> about the job that is failing.  Presumably all the successful jobs are 
> producing output files and successfully sending them back?  If file 
> stage-backs are failing in general, then this is likely to be a 
> different problem from the case where stage-backs are working for all 
> but a fraction of your jobs.
> 
> If there is nothing special about the job, then check to see if the
> failure always happens when the job runs on a specific machine.  You can 
> see which machine is executing a job by looking in the user log file for 
> the job (whatever was specified by log = X in the submit file).
> 
> Dan Bradley
> University of Wisconsin, Condor Project
> 
> 
> Condor Support Information: 
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe
> condor-users <your_email_address>
> 
> 
> 
> 
> Condor Support Information: 
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe 
> condor-users <your_email_address>
Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe
condor-users <your_email_address>




Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>