[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job run is failing ... not sure why



The other day I submitted a question that was baffling me, why
machines were complaining that an executable didn't exist. Turned out
that the jobs were being submitted to my own machine.  So I figured
that out, but jobs are still failing.

I am trying to run just a simple hello world batch file with this command file:

universe = vanilla
executable = c:\test.bat
initialdir = \\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23
output = \\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23\First-LM_First_2.output
error = \\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23\First-LM_First_2.error
log = \\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23\First-LM_First_2.log
should_transfer_files = no
arguments =
#requirements = OpSys =?= "WINNT52"
rank = (isFarmMachine =?= True) * SlotID

Queue

With the requirements commented out it will schedule to my client
machine.  I've confirmed this with the scheduler log, and tailing the
StarterLog on my machine I see this:

8/26 09:30:55 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
8/26 10:20:01 ******************************************************
8/26 10:20:01 ** condor_starter (CONDOR_STARTER) STARTING UP
8/26 10:20:01 ** C:\Condor\bin\condor_starter.exe
8/26 10:20:01 ** $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
8/26 10:20:01 ** $CondorPlatform: INTEL-WINNT50 $
8/26 10:20:01 ** PID = 2984
8/26 10:20:01 ** Log last touched 8/26 08:30:55
8/26 10:20:01 ******************************************************
8/26 10:20:01 Using config source: C:\Condor\condor_config
8/26 10:20:01 Using local config sources:
8/26 10:20:01    C:\Condor/condor_config.local
8/26 10:20:01 DaemonCore: Command Socket at <10.10.41.74:3078>
8/26 10:20:01 Setting resource limits not implemented!
8/26 10:20:01 Communicating with shadow <10.10.41.74:3070>
8/26 10:20:01 Submitting machine is "D1019079.eac.ad.ea.com"
8/26 10:20:01 setting the orig job name in starter
8/26 10:20:01 setting the orig job iwd in starter
8/26 10:20:01 Job 110.0 set to execute immediately
8/26 10:20:01 Starting a VANILLA universe job with ID: 110.0
8/26 10:20:01 IWD:
\\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23
8/26 10:20:01 Output file:
\\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23\First-LM_First_2.output
8/26 10:20:01 Error file:
\\eac.ad.ea.com\sports\artworks\renderfarm\BakingPool\10c87638-a7e0-4805-802f-543691508a23\First-LM_First_2.error
8/26 10:20:01 Renice expr "10" evaluated to 10
8/26 10:20:01 About to exec C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat
8/26 10:20:01 Create_Process succeeded, pid=5904
8/26 10:20:02 Process exited, pid=5904, status=1
8/26 10:20:22 condor_read(): timeout reading 5 bytes from <10.10.41.74:1092>.
8/26 10:20:22 IO: Failed to read packet header
8/26 10:20:22 condor_write(): Socket closed when trying to write 302
bytes to <10.10.41.74:3087>, fd is 1328
8/26 10:20:22 Buf::write(): condor_write() failed
8/26 10:20:22 SECMAN: Error sending response classad!
MyType = "(unknown type)"
TargetType = "(unknown type)"
AuthMethods = "NTSSPI,KERBEROS"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "PREFERRED"
Authentication = "OPTIONAL"
Encryption = "OPTIONAL"
Integrity = "OPTIONAL"
Enact = "NO"
Subsystem = "STARTD"
ParentUniqueID = "D1019079:164:1251225140"
ServerPid = 2304
SessionDuration = "8640000"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $"
ServerCommandSock = "<10.10.41.74:1092>"
Command = 60010
AuthCommand = 60000
8/26 10:20:22 ERROR: DC_AUTHENTICATE unable to receive auth_info!
8/26 10:20:22 Got SIGQUIT.  Performing fast shutdown.
8/26 10:20:22 ShutdownFast all jobs.
8/26 10:20:22 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

I'm completely unsure what the problem is.  I haven't been able to
find condor_exec.bat anywhere, I'm kind of confused why it's running
that instead of test.bat, is this something that the starter creates
to run jobs?  And what are these errors about? What is it trying to
read and write to the socket? What authentication is being done here?

Any help at all appreciated.

Mark.