[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job continually being run due to shadow exception errors.



On 2/15/06, Greg.Hitchen@xxxxxxxx <Greg.Hitchen@xxxxxxxx> wrote:
>
> Hi All
>
> Below are some log files from a job submission that appears to run OK
> and
> produce the correct program output BUT condor considers it to have
> failed
> and keeps it in the queue and keeps resubmitting it and re-executing it.
>
> The job is a monte carlo simulation that can be limited to run for
> X amount of time. I have set it to run for 10mins CPU time.
>
> The strange thing is that the condor job log file is there, even though
> the
> log files below indicate that the file transfer fails, and therefore
> causes
> the starter to exit, which in turn causes the shadow exception error,
> which is why condor keeps trying to run it all the time.

What is being written to D7EG9AB.condorlog and how?
Is it explicitly listed in the transfer files list or are you relying
on the auto transmission of all files which have changed?

Error code 2 indicates that the starter cannot find the specified file...

Matt

<snip>

> STARTERLOG from executing machine (3 hour difference due to different
> time zone)

> 2/14 18:49:06 Output file: C:\Condor/execute\dir_4784\D7EG9AB.log
> 2/14 18:49:06 Renice expr "10" evaluated to 10
> 2/14 18:49:06 About to exec C:\Condor\execute\dir_4784\condor_exec.exe
> D7EG9AB.egs
> 2/14 18:49:06 Create_Process succeeded, pid=4664
> 2/14 18:58:51 Process exited, pid=4664, status=0
> 2/14 18:58:52 ReliSock: put_file: Failed to open file
> C:\Condor/execute\dir_4784\D7EG9AB.condorlog, errno = 2.
> 2/14 18:58:52 ERROR "DoUpload: Failed to send file
> C:\Condor/execute\dir_4784\D7EG9AB.condorlog, exiting at 1398
> " at line 1397 in file ..\src\condor_c++_util\file_transfer.C