[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job is queued & runs, but fails with status -1073741502



Hello all,

we've been working to set up a small Windoze based pool using the
condor_credd daemon to allow us to run jobs under the submitting users
id.  After lots of fun & games, finally got this to work.  However,
still having a number of problems...

The setup currently consists of 3 machines:  1 is the central manager, 1
is a submit machine and 1 (my laptop) is a submit/execute machine.  

Today, after running several jobs successfully (actually, the same job
multiple times), I ran the same job again and it failed to run.  Here
are the details from the ShadowLog for the 2 jobs (first one worked,
second failed).  Note the exit status for the second job of -1073741502:

	9/11 11:00:29
******************************************************
	9/11 11:00:29 ** condor_shadow (CONDOR_SHADOW) STARTING UP
	9/11 11:00:29 ** C:\PROGRA~1\Condor\7-2-4\bin\condor_shadow.exe
	9/11 11:00:29 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
	9/11 11:00:29 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
	9/11 11:00:29 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID:
159529 $
	9/11 11:00:29 ** $CondorPlatform: INTEL-WINNT50 $
	9/11 11:00:29 ** PID = 4996
	9/11 11:00:29 ** Log last touched 9/11 09:36:06
	9/11 11:00:29
******************************************************
	9/11 11:00:29 Using config source: C:\Program
Files\Condor\7-2-4\condor_config
	9/11 11:00:29 Using local config sources: 
	9/11 11:00:29    C:\PROGRA~1\Condor\7-2-4/condor_config.local
	9/11 11:00:29
C:\PROGRA~1\Condor\7-2-4/condor_config.local.credd
	9/11 11:00:29 DaemonCore: Command Socket at
<xxx.xxx.xxx.xxx:xxxx>
	9/11 11:00:29 Initializing a VANILLA shadow for job 54.0
	9/11 11:00:37 (54.0) (4996): Request to run on
xxxxxxxxxxxxxxxxxxxx <xxx.xxx.xxx.xxx:xxxx> was ACCEPTED
	9/11 11:10:20 (54.0) (4996): Job 54.0 terminated: exited with
status 0
	9/11 11:10:27 (54.0) (4996): **** condor_shadow (condor_SHADOW)
pid 4996 EXITING WITH STATUS 100
	9/11 11:18:53
******************************************************
	9/11 11:18:53 ** condor_shadow (CONDOR_SHADOW) STARTING UP
	9/11 11:18:53 ** C:\PROGRA~1\Condor\7-2-4\bin\condor_shadow.exe
	9/11 11:18:53 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
	9/11 11:18:53 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
	9/11 11:18:53 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID:
159529 $
	9/11 11:18:53 ** $CondorPlatform: INTEL-WINNT50 $
	9/11 11:18:53 ** PID = 4176
	9/11 11:18:53 ** Log last touched 9/11 10:10:27
	9/11 11:18:53
******************************************************
	9/11 11:18:53 Using config source: C:\Program
Files\Condor\7-2-4\condor_config
	9/11 11:18:53 Using local config sources: 
	9/11 11:18:53    C:\PROGRA~1\Condor\7-2-4/condor_config.local
	9/11 11:18:53
C:\PROGRA~1\Condor\7-2-4/condor_config.local.credd
	9/11 11:18:54 DaemonCore: Command Socket at
<xxx.xxx.xxx.xxx:xxxx>
	9/11 11:18:54 Initializing a VANILLA shadow for job 55.0
	9/11 11:19:02 (55.0) (4176): Request to run on
xxxxxxxxxxxxxxxxxxxx <xxx.xxx.xxx.xxx:xxxx> was ACCEPTED
	9/11 11:19:25 (55.0) (4176): Job 55.0 terminated: exited with
status -1073741502
	9/11 11:19:28 (55.0) (4176): **** condor_shadow (condor_SHADOW)
pid 4176 EXITING WITH STATUS 100

Note that I was getting this same error yesterday, but after restarting
the execute machine today, I was able to run 3 jobs successfully before
it started to error again.  I've found a number of references to this
exit code in the mailing list archives, but nothing very definitive
about what might be the cause...especially in this case where the same
job worked 20 minutes prior.  Note also, that I've examined all of the
other log files on both the execute & central nodes & don't see anything
different at the times these 2 jobs were processed (eg, the entries for
the failed job look the same as those for the successful job).

My submit script executes a .bat script, which runs fine interactively
on the execute machine.

Any help would be very much appreciated!

Thanks,
Dave Patch