[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor 7.2 on windows--dumb batch file fails that worked on 7.1.0



We have a 100-node windows cluster running 7.1.0, except for one
machine (plowshare) that I've updated to 7.2.
I'm having difficulties getting the 7.2 machine to run jobs, so I
assembled a stupidly simple batch file, and just sent
that.  On a randomly chosen 7.1.0 machine (43), there's no
problem--the job runs, and the output file contains
what one would expect.  On the 7.2 machine, however, the job
terminates with exit code 128, and nothing is
written to the output or error files.


mystupid.bat -- "executable"
-------------------
mkdir temp
echo "dir:"
dir
set TMP=%_CONDOR_SCRATCH_DIR%\temp
set TEMP=%_CONDOR_SCRATCH_DIR%\temp
echo "dir temp"
dir temp
whoami

mystupid.log.43 -- log file on a 7.1.0 machine
-----------------------
000 (108.000.000) 01/07 15:08:07 Job submitted from host: <34.52.12.4:38333>
...
001 (108.000.000) 01/07 15:08:10 Job executing on host: <34.52.8.225:1055>
...
005 (108.000.000) 01/07 15:08:15 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	866  -  Run Bytes Sent By Job
	138  -  Run Bytes Received By Job
	866  -  Total Bytes Sent By Job
	138  -  Total Bytes Received By Job

mystupid.out.43 -- output file on a 7.1.0 machine
-----------------------
"dir:"
 Volume in drive C has no label.
 Volume Serial Number is 80BB-A5FA

 Directory of C:\condor\execute\dir_1280

01/07/2009  03:08 PM    <DIR>          .
01/07/2009  03:08 PM    <DIR>          ..
01/07/2009  03:05 PM               138 condor_exec.bat
01/07/2009  03:08 PM                 0 mystupid.err
01/07/2009  03:08 PM                 0 mystupid.out
01/07/2009  03:08 PM    <DIR>          temp
               3 File(s)            138 bytes
               3 Dir(s)  37,877,338,112 bytes free
"dir temp"
 Volume in drive C has no label.
 Volume Serial Number is 80BB-A5FA

 Directory of C:\condor\execute\dir_1280\temp

01/07/2009  03:08 PM    <DIR>          .
01/07/2009  03:08 PM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  37,877,334,016 bytes free
enaus00053043\condor-reuse-slot1


mystupid.log.plowshare -- log file on the 7.2 machine
----------------------------------
000 (109.000.000) 01/07 15:10:14 Job submitted from host: <34.52.12.4:38333>
...
001 (109.000.000) 01/07 15:10:16 Job executing on host: <34.52.8.222:4465>
...
005 (109.000.000) 01/07 15:10:16 Job terminated.
	(1) Normal termination (return value 128)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	138  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	138  -  Total Bytes Received By Job

mystupid.out.plowshare -- output file on the 7.2 machine (empty)
----------------------------------
0-byte file

Help, please?

Here's some possibly-relevant log snippets from the 7.2 machine:

StarterLog.slot1
-----------------------
1/7 15:10:15 ******************************************************
1/7 15:10:15 ** condor_starter (CONDOR_STARTER) STARTING UP
1/7 15:10:15 ** C:\condor\bin\condor_starter.exe
1/7 15:10:15 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
1/7 15:10:15 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
1/7 15:10:15 ** $CondorVersion: 7.2.0 Dec 21 2008 BuildID: none $
1/7 15:10:15 ** $CondorPlatform: INTEL-WINNT50 $
1/7 15:10:15 ** PID = 3132
1/7 15:10:15 ** Log last touched 1/7 14:58:15
1/7 15:10:15 ******************************************************
1/7 15:10:15 Using config source: C:\condor\condor_config
1/7 15:10:15 Using local config sources:
1/7 15:10:15    C:\condor/condor_config.local
1/7 15:10:15 DaemonCore: Command Socket at <34.52.8.222:4585>
1/7 15:10:15 GLEXEC_JOB not supported on this platform; ignoring
1/7 15:10:15 Setting resource limits not implemented!
1/7 15:10:15 Communicating with shadow <34.52.12.4:41983>
1/7 15:10:15 Submitting machine is "feynman.corp.halliburton.com"
1/7 15:10:15 setting the orig job name in starter
1/7 15:10:15 setting the orig job iwd in starter
1/7 15:10:15 File transfer completed successfully.
1/7 15:10:16 Job 109.0 set to execute immediately
1/7 15:10:16 Starting a VANILLA universe job with ID: 109.0
1/7 15:10:16 Tracking process family by login "condor-reuse-slot1"
1/7 15:10:16 IWD: C:\condor\execute\dir_3132
1/7 15:10:16 Output file: C:\condor\execute\dir_3132\mystupid.out
1/7 15:10:16 Error file: C:\condor\execute\dir_3132\mystupid.err
1/7 15:10:16 Renice expr "10" evaluated to 10
1/7 15:10:16 About to exec C:\WINNT\system32\cmd.exe /Q /C condor_exec.bat
1/7 15:10:16 Create_Process succeeded, pid=2116
1/7 15:10:16 Process exited, pid=2116, status=128
1/7 15:10:16 Got SIGQUIT.  Performing fast shutdown.
1/7 15:10:16 ShutdownFast all jobs.
1/7 15:10:16 **** condor_starter (condor_STARTER) pid 3132 EXITING WITH STATUS 0


StartLog
------------
1/7 15:10:14 slot1: match_info called
1/7 15:10:14 slot1: Received match <34.52.8.222:4465>#1231361361#8#...
1/7 15:10:14 slot1: State change: match notification protocol successful
1/7 15:10:14 slot1: Changing state: Unclaimed -> Matched
1/7 15:10:14 slot1: Request accepted.
1/7 15:10:14 slot1: Remote owner is grant@xxxxxxxxxxxxxxxxxxxx
1/7 15:10:14 slot1: State change: claiming protocol successful
1/7 15:10:14 slot1: Changing state: Matched -> Claimed
1/7 15:10:14 slot1: Got activate_claim request from shadow (<34.52.12.4:36569>)
1/7 15:10:14 slot1: Remote job ID is 109.0
1/7 15:10:15 slot1: Got universe "VANILLA" (5) from request classad
1/7 15:10:15 slot1: State change: claim-activation protocol successful
1/7 15:10:15 slot1: Changing activity: Idle -> Busy
1/7 15:10:16 slot1: Called deactivate_claim_forcibly()
1/7 15:10:16 condor_write(): Socket closed when trying to write 56
bytes to <34.52.12.4:40339>, fd is 228
1/7 15:10:16 Buf::write(): condor_write() failed
1/7 15:10:16 slot1: State change: received RELEASE_CLAIM command
1/7 15:10:16 slot1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
1/7 15:10:16 Starter pid 3132 exited with status 0
1/7 15:10:16 slot1: State change: starter exited
1/7 15:10:16 slot1: State change: No preempting claim, returning to owner
1/7 15:10:16 slot1: Changing state and activity: Preempting/Vacating
-> Owner/Idle
1/7 15:10:16 slot1: State change: IS_OWNER is false
1/7 15:10:16 slot1: Changing state: Owner -> Unclaimed

The submission machine is a linux box running 7.1.0.

Thanks,
Grant
-- 
Grant Goodyear		
web: http://www.grantgoodyear.org	
e-mail: grant@xxxxxxxxxxxxxxxxx