[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] One sub-job repeating over and over...



Yes, Dan

You understand correctly. An additional note should be that all 10 jobs are
the same job with a different starting argument. In otherwords, job 1 is
told to process frames 1-10, job 2 does frames 11-20 and so on. All the jobs
finish but the one, which is only different (as far as I can tell) in that
it is processing frames 50-60. The problem occurs on any machine this
particular job runs on. Manually inspecting the execute directory confirms
that the job really is finishing all the output files, but then failing when
it tries to upload them back to the manager.

I've worked around the problem (for now) by dividing into 20 jobs instead of
10. 

-----Original Message-----
From: owner-condor-users@xxxxxxxxxxx [mailto:owner-condor-users@xxxxxxxxxxx]
On Behalf Of Dan Bradley
Sent: Tuesday, December 09, 2003 1:44 PM
To: condor-users@xxxxxxxxxxx
Subject: Re: [condor-users] One sub-job repeating over and over...




Heinz, Michael William wrote:

>Except... One of the tasks is completing, then failing to send the 
>files back to the central manager and, next, the central manager starts 
>the job over!
>

Let me make sure I understand the situation.  You are submitting 10 jobs 
and 9 of them are finishing successfully.  The remaining one is running 
to completion but is then failing to send back output files, which 
causes it to remain in the queue to be rescheduled.

If that is right, then first verify that there is nothing different 
about the job that is failing.  Presumably all the successful jobs are 
producing output files and successfully sending them back?  If file 
stage-backs are failing in general, then this is likely to be a 
different problem from the case where stage-backs are working for all 
but a fraction of your jobs.

If there is nothing special about the job, then check to see if the 
failure always happens when the job runs on a specific machine.  You can 
see which machine is executing a job by looking in the user log file for 
the job (whatever was specified by log = X in the submit file).

Dan Bradley
University of Wisconsin, Condor Project


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe
condor-users <your_email_address>




Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>