[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job continually being run due to shadowexception errors.



Hi Matt and Jaime

Thanks for you replies.

The D7EG9AB.condorlog file is simply the standard condor user log file.
Shouldn't condor be creating this, not the job?

The submit file is listed below.

executable = egs.exe
environment = XPERT_DIR=\\arthur-lu\montecarlo
output     = D7EG9AB.log
log        = D7EG9AB.condorlog
arguments  = D7EG9AB.egs
universe   = vanilla
transfer_input_files = egs.exe,D7EG9AB.egs,auto_design7.pegsdat
transfer_output_files = D7EG9AB.log,D7EG9AB.condorlog
queue

I have enabled the ALL_DEBUG = D_FULLDEBUG option in the config file
but am waiting for the changes to percolate around the pool (our
install script creates a scheduled task that runs daily and checks
our main condor repository for updated binaries and/or config files).

I am hoping this will give more detailed info regarding this problem

The D7EG9AB.condorlog file (that I previously referred to as the JOB
LOG) contained:

000 (002.000.000) 02/14 15:20:50 Job submitted from host:
<130.116.140.99:9138> ... 
001 (002.000.000) 02/14 15:48:27 Job executing on host:
<130.155.66.195:9549> ... 
007 (002.000.000) 02/14 15:58:09 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(130.155.66.195)
	0  -  Run Bytes Sent By Job
	2209052  -  Run Bytes Received By Job
...
001 (002.000.000) 02/14 15:58:31 Job executing on host:
<130.155.66.195:9549> ... 
007 (002.000.000) 02/14 16:08:22 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(130.155.66.195)
	0  -  Run Bytes Sent By Job
	2209052  -  Run Bytes Received By Job

Thanks.

Cheers

Greg

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
> Sent: Thursday, 16 February 2006 8:10 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Job continually being run due to 
> shadowexception errors.
> 
> 
> On Feb 14, 2006, at 10:06 PM, <Greg.Hitchen@xxxxxxxx>  
> <Greg.Hitchen@xxxxxxxx> wrote:
> 
> > Below are some log files from a job submission that appears 
> to run OK 
> > and produce the correct program output BUT condor considers 
> it to have
> > failed
> > and keeps it in the queue and keeps resubmitting it and re- 
> > executing it.
> >
> > The job is a monte carlo simulation that can be limited to 
> run for X 
> > amount of time. I have set it to run for 10mins CPU time.
> >
> > The strange thing is that the condor job log file is there, even
> > though
> > the
> > log files below indicate that the file transfer fails, and therefore
> > causes
> > the starter to exit, which in turn causes the shadow 
> exception error,
> > which is why condor keeps trying to run it all the time.
> 
> The Condor user log (what you call the Condor Job Log) is written on  
> the submit side, so its contents are unaffected by the file transfer  
> problems (other than noting the failures).
> 
> For files that are transferred from the execute machine, Condor  
> creates empty copies when the job is submitted to verify that it can  
> write to them later (when the job completes).
> 
> As Matt noted, it looks like you specified D7EG9AB.condorlog to be  
> transferred, but your job isn't creating the file.
> 
> +--------------------------------+-----------------------------------+
> |           Jaime Frey           | I used to be a heavy gambler.     |
> |       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
> | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
> +--------------------------------+-----------------------------------+
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx 
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>