[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [condor-users] One sub-job repeating over and over...
- Date: Tue, 9 Dec 2003 14:05:49 -0500
- From: "Heinz, Michael William" <michael_heinz@xxxxxxxxx>
- Subject: RE: [condor-users] One sub-job repeating over and over...
You understand correctly. An additional note should be that all 10 jobs are
the same job with a different starting argument. In otherwords, job 1 is
told to process frames 1-10, job 2 does frames 11-20 and so on. All the jobs
finish but the one, which is only different (as far as I can tell) in that
it is processing frames 50-60. The problem occurs on any machine this
particular job runs on. Manually inspecting the execute directory confirms
that the job really is finishing all the output files, but then failing when
it tries to upload them back to the manager.
I've worked around the problem (for now) by dividing into 20 jobs instead of
From: owner-condor-users@xxxxxxxxxxx [mailto:owner-condor-users@xxxxxxxxxxx]
On Behalf Of Dan Bradley
Sent: Tuesday, December 09, 2003 1:44 PM
Subject: Re: [condor-users] One sub-job repeating over and over...
Heinz, Michael William wrote:
>Except... One of the tasks is completing, then failing to send the
>files back to the central manager and, next, the central manager starts
>the job over!
Let me make sure I understand the situation. You are submitting 10 jobs
and 9 of them are finishing successfully. The remaining one is running
to completion but is then failing to send back output files, which
causes it to remain in the queue to be rescheduled.
If that is right, then first verify that there is nothing different
about the job that is failing. Presumably all the successful jobs are
producing output files and successfully sending them back? If file
stage-backs are failing in general, then this is likely to be a
different problem from the case where stage-backs are working for all
but a fraction of your jobs.
If there is nothing special about the job, then check to see if the
failure always happens when the job runs on a specific machine. You can
see which machine is executing a job by looking in the user log file for
the job (whatever was specified by log = X in the submit file).
University of Wisconsin, Condor Project
Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>