[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job continually being run due to shadow exception errors.



Hi All

Below are some log files from a job submission that appears to run OK
and
produce the correct program output BUT condor considers it to have
failed
and keeps it in the queue and keeps resubmitting it and re-executing it.

The job is a monte carlo simulation that can be limited to run for 
X amount of time. I have set it to run for 10mins CPU time.

The strange thing is that the condor job log file is there, even though
the
log files below indicate that the file transfer fails, and therefore
causes
the starter to exit, which in turn causes the shadow exception error,
which is why condor keeps trying to run it all the time.

Thanks for any help.

Cheers

Greg

CONDOR JOB LOG from submitting machine

000 (002.000.000) 02/14 15:20:50 Job submitted from host:
<130.116.140.99:9138>
...
001 (002.000.000) 02/14 15:48:27 Job executing on host:
<130.155.66.195:9549>
...
007 (002.000.000) 02/14 15:58:09 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(130.155.66.195)
	0  -  Run Bytes Sent By Job
	2209052  -  Run Bytes Received By Job
...
001 (002.000.000) 02/14 15:58:31 Job executing on host:
<130.155.66.195:9549>
...
007 (002.000.000) 02/14 16:08:22 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(130.155.66.195)
	0  -  Run Bytes Sent By Job
	2209052  -  Run Bytes Received By Job

SCHEDLOG from submitting machine

2/14 15:47:40 Sent ad to central manager for hit023@xxxxxxxx
2/14 15:48:03 Activity on stashed negotiator socket
2/14 15:48:03 Negotiating for owner: hit023@xxxxxxxx
2/14 15:48:03 Checking consistency running and runnable jobs
2/14 15:48:03 Tables are consistent
2/14 15:48:03 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 1
2/14 15:48:03 Sent ad to central manager for hit023@xxxxxxxx
2/14 15:48:06 Started shadow for job 2.0 on "<130.155.66.195:9549>",
(shadow pid = 8568)
2/14 15:48:08 Sent ad to central manager for hit023@xxxxxxxx
2/14 15:53:09 Sent ad to central manager for hit023@xxxxxxxx
2/14 15:58:09 DaemonCore: Command received via UDP from host
<130.116.140.99:9558>
2/14 15:58:09 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
2/14 15:58:09 Shadow pid 8568 for job 2.0 exited with status 4
2/14 15:58:09 ERROR: Shadow exited with job exception code!

SHADOWLOG from submitting machine

2/14 15:48:06 ******************************************************
2/14 15:48:06 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/14 15:48:06 ** C:\Condor\bin\condor_shadow.exe
2/14 15:48:06 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/14 15:48:06 ** $CondorPlatform: INTEL-WINNT50 $
2/14 15:48:06 ** PID = 8568
2/14 15:48:06 ******************************************************
2/14 15:48:06 Using config file: c:\condor\condor_config
2/14 15:48:06 Using local config files: C:\Condor/condor_config.local
2/14 15:48:06 DaemonCore: Command Socket at <130.116.140.99:9403>
2/14 15:48:07 Initializing a VANILLA shadow
2/14 15:48:09 (2.0) (8568): Request to run on <130.155.66.195:9549> was
ACCEPTED
2/14 15:58:09 (2.0) (8568): condor_read(): recv() returned -1, errno =
10054, assuming failure.
2/14 15:58:09 (2.0) (8568): condor_read(): recv() returned -1, errno =
10054, assuming failure.
2/14 15:58:09 (2.0) (8568): ERROR "Can no longer talk to condor_starter
on execute machine (130.155.66.195)" at line 63 in file
..\src\condor_shadow.V6.1\NTreceivers.C

STARTERLOG from executing machine (3 hour difference due to different
time zone)

2/14 18:48:47 ******************************************************
2/14 18:48:47 ** condor_starter (CONDOR_STARTER) STARTING UP
2/14 18:48:47 ** C:\Condor\bin\condor_starter.exe
2/14 18:48:47 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/14 18:48:47 ** $CondorPlatform: INTEL-WINNT50 $
2/14 18:48:47 ** PID = 4784
2/14 18:48:47 ******************************************************
2/14 18:48:47 Using config file: c:\condor\condor_config
2/14 18:48:47 Using local config files: C:\Condor/condor_config.local
2/14 18:48:47 DaemonCore: Command Socket at <130.155.66.195:9056>
2/14 18:48:47 Setting resource limits not implemented!
2/14 18:48:47 Starter communicating with condor_shadow
<130.116.140.99:9403>
2/14 18:48:47 Submitting machine is "gregh-kf.arrc.csiro.au"
2/14 18:49:06 File transfer completed successfully.
2/14 18:49:06 Starting a VANILLA universe job with ID: 2.0
2/14 18:49:06 IWD: C:\Condor/execute\dir_4784
2/14 18:49:06 Output file: C:\Condor/execute\dir_4784\D7EG9AB.log
2/14 18:49:06 Renice expr "10" evaluated to 10
2/14 18:49:06 About to exec C:\Condor\execute\dir_4784\condor_exec.exe
D7EG9AB.egs
2/14 18:49:06 Create_Process succeeded, pid=4664
2/14 18:58:51 Process exited, pid=4664, status=0
2/14 18:58:52 ReliSock: put_file: Failed to open file
C:\Condor/execute\dir_4784\D7EG9AB.condorlog, errno = 2.
2/14 18:58:52 ERROR "DoUpload: Failed to send file
C:\Condor/execute\dir_4784\D7EG9AB.condorlog, exiting at 1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C
2/14 18:58:52 ShutdownFast all jobs.
2/14 18:58:52 Error disabling account condor-reuse-vm1 (ACCESS DENIED)


-----------------------------------------------------------------------
Greg Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining				phone:+61 8 6436
8663
Australian Resources Research Centre (ARRC)	fax:	+61 8 6436 8555
Postal address:						mob:	0407 952
748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------