[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job ExitCode differnt on Windows 7 vs Windows XP



Hi,

I am trying to add fault tolerance to my condor pool.  I am attempting to retry jobs up to 5 times if they return a non zero ExitCode.using requirements in the submission file:
== 0 || (ExitCode != 0 && JobRunCount >= 5)

This is working on Windows 7 machines, but not on my Xp machines.  Condor believes the return code of the failing jobs is always zero on the XP machines.  I have attached snippets from two StarterLogs, one on a Win7 slot, and one on an XP Slot.  In each case I have logged onto the machine a job was running on and stimulated a failure in the same way.  I have assured in my application logs, and job stdout log that the .bat file that is referenced as the command in the submit file is returning a non zero error code.  I think I am returning error code from the .bat file in the "right" way.

I am using Condor 7.2.5.  Does anyone know if this was a bug that was fixed?  It seems pretty critical, so perhaps there is some other explanation for the behavior I am seeing.  I need some help.  Is this the kind of grief I should expect when working with .bat files?

--
--Derrick
5/18 22:03:42 ******************************************************
5/18 22:03:42 Using config source: C:\Program Files\ERDAS\ERDAS Condor 2011\condor_config
5/18 22:03:42 Using local config sources: 
5/18 22:03:42    C:\PROGRA~1\ERDAS\ERDASC~1/condor_config.local
5/18 22:03:42 DaemonCore: Command Socket at <10.44.7.248:59091>
5/18 22:03:42 GLEXEC_JOB not supported on this platform; ignoring
5/18 22:03:42 Setting resource limits not implemented!
5/18 22:03:42 Communicating with shadow <10.44.7.143:55089>
5/18 22:03:42 Submitting machine is "jabberwocky.lggm.llc"
5/18 22:03:42 setting the orig job name in starter
5/18 22:03:42 setting the orig job iwd in starter
5/18 22:03:43 File transfer completed successfully.
5/18 22:03:44 Job 17128.0 set to execute immediately
5/18 22:03:44 Starting a VANILLA universe job with ID: 17128.0
5/18 22:03:44 IWD: C:\PROGRA~1\ERDAS\ERDASC~1\execute\dir_2076
5/18 22:03:44 Output file: C:\PROGRA~1\ERDAS\ERDASC~1\execute\dir_2076\97_20_resampleprocess_24003_img_8.out
5/18 22:03:44 Error file: C:\PROGRA~1\ERDAS\ERDASC~1\execute\dir_2076\97_20_resampleprocess_24003_img_8.err
5/18 22:03:44 Renice expr "10" evaluated to 10
5/18 22:03:44 About to exec C:\Windows\system32\cmd.exe /Q /C condor_exec.bat
5/18 22:03:44 Create_Process succeeded, pid=2640
5/18 22:07:19 Process exited, pid=2640, status=1
5/18 22:07:19 Got SIGQUIT.  Performing fast shutdown.
5/18 22:07:19 ShutdownFast all jobs.
5/18 22:07:19 **** condor_starter (condor_STARTER) pid 2076 EXITING WITH STATUS 
5/18 22:10:42 Process exited, pid=3268, status=0
5/18 22:10:42 Got SIGQUIT.  Performing fast shutdown.
5/18 22:10:42 ShutdownFast all jobs.
5/18 22:10:43 **** condor_starter (condor_STARTER) pid 2080 EXITING WITH STATUS 0
5/18 22:10:44 ******************************************************
5/18 22:10:44 ** condor_starter (CONDOR_STARTER) STARTING UP
5/18 22:10:44 ** C:\PROGRA~1\ERDAS\ERDASC~1\bin\condor_starter.exe
5/18 22:10:44 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
5/18 22:10:44 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
5/18 22:10:44 ** $CondorVersion: 7.2.5 Dec 16 2009 BuildID: 204104 $
5/18 22:10:45 ** $CondorPlatform: INTEL-WINNT50 $
5/18 22:10:45 ** PID = 2648
5/18 22:10:45 ** Log last touched 5/18 21:10:43
5/18 22:10:45 ******************************************************
5/18 22:10:45 Using config source: C:\Program Files\ERDAS\ERDAS Condor 2011\condor_config
5/18 22:10:45 Using local config sources: 
5/18 22:10:45    C:\PROGRA~1\ERDAS\ERDASC~1/condor_config.local
5/18 22:10:45 DaemonCore: Command Socket at <10.44.7.201:1991>
5/18 22:10:45 GLEXEC_JOB not supported on this platform; ignoring
5/18 22:10:45 Setting resource limits not implemented!
5/18 22:10:45 Communicating with shadow <10.44.7.143:55835>
5/18 22:10:45 Submitting machine is "jabberwocky.lggm.llc"
5/18 22:10:45 setting the orig job name in starter
5/18 22:10:45 setting the orig job iwd in starter
5/18 22:10:45 File transfer completed successfully.
5/18 22:10:46 Job 17142.0 set to execute immediately
5/18 22:10:46 Starting a VANILLA universe job with ID: 17142.0
5/18 22:10:46 IWD: C:\PROGRA~1\ERDAS\ERDASC~1\execute\dir_2648
5/18 22:10:46 Output file: C:\PROGRA~1\ERDAS\ERDASC~1\execute\dir_2648\97_20_resampleprocess_23004_img_22.out
5/18 22:10:46 Error file: C:\PROGRA~1\ERDAS\ERDASC~1\execute\dir_2648\97_20_resampleprocess_23004_img_22.err
5/18 22:10:46 Renice expr "10" evaluated to 10
5/18 22:10:46 About to exec C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat
5/18 22:10:46 Create_Process succeeded, pid=3256
5/18 22:16:00 Process exited, pid=3256, status=0
5/18 22:16:01 Got SIGQUIT.  Performing fast shutdown.
5/18 22:16:01 ShutdownFast all jobs.
5/18 22:16:01 **** condor_starter (condor_STARTER) pid 2648 EXITING WITH STATUS 0