[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor 7.2 on windows--dumb batch file fails that worked on 7.1.0



 Burnett@xxxxxxxxxxx wrote:
>
> I've tried to reproduce your problem, but your batch file seems
> to work well for me under 7.2.0.  Could you possibly bump up the
> debug level for the logs, attach your configuration file and
> a listing of the submit file?  (I imagine the submit file is about
> as interesting as the batch file, so maybe it won't be much
> help... but who knows.)

mystupid.sub (submission file)
------------

Executable = mystupid.bat
Universe   = vanilla
Log        = mystupid.log
Output     = mystupid.out
Error      = mystupid.err
+AccountingGroup = "grant"
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
Requirements = ( OpSys == "WINNT52" && Machine ==
"plowshare.corp.halliburton.com" )

config file:
http://www.grantgoodyear.org/~grant/condor_config

mystupid.log
------------

000 (112.000.000) 01/09 12:06:56 Job submitted from host: <34.52.12.4:33244>
...
001 (112.000.000) 01/09 12:06:58 Job executing on host: <34.52.8.222:3423>
...
005 (112.000.000) 01/09 12:06:58 Job terminated.
	(1) Normal termination (return value 128)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	138  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	138  -  Total Bytes Received By Job

StarterLog.slot1
----------------
1/9 12:06:47 WARNING: Config source is empty: C:\condor/condor_config.local
1/9 12:06:47 ******************************************************
1/9 12:06:47 ** condor_starter (CONDOR_STARTER) STARTING UP
1/9 12:06:47 ** C:\condor\bin\condor_starter.exe
1/9 12:06:47 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
1/9 12:06:47 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
1/9 12:06:47 ** $CondorVersion: 7.2.0 Dec 21 2008 BuildID: none $
1/9 12:06:47 ** $CondorPlatform: INTEL-WINNT50 $
1/9 12:06:47 ** PID = 3652
1/9 12:06:47 ** Log last touched 1/7 15:49:59
1/9 12:06:47 ******************************************************
1/9 12:06:47 Using config source: C:\condor\condor_config
1/9 12:06:47 Using local config sources:
1/9 12:06:47    C:\condor/condor_config.local
1/9 12:06:47 DaemonCore: Command Socket at <34.52.8.222:3447>
1/9 12:06:47 Will use UDP to update collector crossroads.corp.halliburton.com
<34.52.8.226:9618>
1/9 12:06:47 GLEXEC_JOB not supported on this platform; ignoring
1/9 12:06:47 Setting resource limits not implemented!
1/9 12:06:47 Communicating with shadow <34.52.12.4:56311>
1/9 12:06:47 Shadow version: $CondorVersion: 7.1.0 Apr  1 2008 BuildID: 80895 $
1/9 12:06:47 Submitting machine is "feynman.corp.halliburton.com"
1/9 12:06:47 Instantiating a StarterHookMgr
1/9 12:06:47 Job does not define HookKeyword, not invoking any job hooks.
1/9 12:06:47 setting the orig job name in starter
1/9 12:06:47 setting the orig job iwd in starter
1/9 12:06:47 ShouldTransferFiles is "YES", transfering files
1/9 12:06:47 init_user_ids: want user 'nobody@.', current is '(null)@(null)'
1/9 12:06:47 Using dynamic user account.
1/9 12:06:47 dynuser: Re-enabling account (condor-reuse-slot1)
1/9 12:06:47 dynuser::createuser(condor-reuse-slot1) successful
1/9 12:06:47 perm::init() starting up for account (condor-reuse-slot1)
domain (NULL)
1/9 12:06:47 perm::init: Found Account Name condor-reuse-slot1
1/9 12:06:47 Done moving to directory "C:\condor\execute\dir_3652"
1/9 12:06:47 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:47 JICShadow::initIOProxy(): Job does not define WantIOProxy
1/9 12:06:47 No StarterUserLog found in job ClassAd
1/9 12:06:47 Starter will not write a local UserLog
1/9 12:06:47 Changing the executable name
1/9 12:06:47 entering FileTransfer::Init
1/9 12:06:47 entering FileTransfer::SimpleInit
1/9 12:06:47 TransferIntermediate="(none)"
1/9 12:06:47 entering FileTransfer::DownloadFiles
1/9 12:06:47 Initialized the following authorization table:
1/9 12:06:47 Authorizations yet to be resolved:
1/9 12:06:47 allow NEGOTIATOR:  */34.52.8.226 */crossroads
1/9 12:06:47 allow ADMINISTRATOR:  */34.52.12.4 */plowshare.corp.halliburton.com
*/feynman.corp.halliburton.com */34.52.8.222
1/9 12:06:47 allow OWNER:  */34.52.12.4 */plowshare.corp.halliburton.com
*/plowshare.corp.halliburton.com */feynman.corp.halliburton.com */34.52.8.222
*/34.52.8.222
1/9 12:06:47 entering FileTransfer::Download
1/9 12:06:47 About to sock duplicate, old sock=270 new sock=FFFFFFFF state=0
1/9 12:06:47 Socket duplicated, old sock=270 new sock=248 state=0
1/9 12:06:47 entering FileTransfer::DownloadThread
1/9 12:06:47 entering FileTransfer::DoDownload sync=1
1/9 12:06:47 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:47 Sending GoAhead for 34.52.12.4 to send
C:\condor\execute\dir_3652\condor_exec.exe and all further files.
1/9 12:06:47 Received GoAhead from peer to receive
C:\condor\execute\dir_3652\condor_exec.exe.
1/9 12:06:47 get_file(): going to write to filename
C:\condor\execute\dir_3652\condor_exec.exe
1/9 12:06:47 get_file: Receiving 138 bytes
1/9 12:06:48 get_file: wrote 138 bytes to file
1/9 12:06:48 ReliSock::get_file_with_permissions(): received null permissions
from peer, not setting
1/9 12:06:48 DaemonCore: in SendAliveToParent()
1/9 12:06:48 DaemonCore: Leaving SendAliveToParent() - success
1/9 12:06:48 File transfer completed successfully.
1/9 12:06:49 Calling client FileTransfer handler function.
1/9 12:06:49 HOOK_PREPARE_JOB not configured.
1/9 12:06:49 Job 112.0 set to execute immediately
1/9 12:06:49 Starting a VANILLA universe job with ID: 112.0
1/9 12:06:49 In OsProc::OsProc()
1/9 12:06:49 Main job KillSignal: 15 (Unknown)
1/9 12:06:49 Main job RmKillSignal: 15 (Unknown)
1/9 12:06:49 Main job HoldKillSignal: 15 (Unknown)
1/9 12:06:49 in VanillaProc::StartJob()
1/9 12:06:49 Executable is .bat, so running C:\WINNT\system32\cmd.exe /Q /C
condor_exec.bat
1/9 12:06:49 Tracking process family by login "condor-reuse-slot1"
1/9 12:06:49 in OsProc::StartJob()
1/9 12:06:49 IWD: C:\condor\execute\dir_3652
1/9 12:06:49 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:49 Input file: NUL
1/9 12:06:49 Output file: C:\condor\execute\dir_3652\mystupid.out
1/9 12:06:49 Error file: C:\condor\execute\dir_3652\mystupid.err
1/9 12:06:49 Renice expr "10" evaluated to 10
1/9 12:06:49 About to exec C:\WINNT\system32\cmd.exe /Q /C condor_exec.bat
1/9 12:06:49 Env = _CONDOR_SLOT=1 _CONDOR_SCRATCH_DIR=C:\condor\execute\dir_3652
1/9 12:06:49 In OwnerProfile::update()
1/9 12:06:49 GetBinaryType() returned 0
1/9 12:06:49 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:49 Create_Process succeeded, pid=2756
1/9 12:06:49 Process exited, pid=2756, status=128
1/9 12:06:49 in VanillaProc::JobReaper()
1/9 12:06:49 Inside OsProc::JobReaper()
1/9 12:06:49 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:49 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:49 Reaper: all=1 handled=1 ShuttingDown=0
1/9 12:06:49 In VanillaProc::PublishUpdateAd()
1/9 12:06:49 Inside OsProc::PublishUpdateAd()
1/9 12:06:49 HOOK_JOB_EXIT not configured.
1/9 12:06:49 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:49 entering FileTransfer::UploadFiles (final_transfer=1)
1/9 12:06:49 Skipping condor_exec.bat
1/9 12:06:49 Sending new file mystupid.err, time==1231524409, size==0
1/9 12:06:49 Sending new file mystupid.out, time==1231524409, size==0
1/9 12:06:49 FileTransfer::UploadFiles: sent TransKey=1#49679240eb210e275a047d8
1/9 12:06:49 entering FileTransfer::Upload
1/9 12:06:49 entering FileTransfer::DoUpload
1/9 12:06:49 DoUpload: send file mystupid.err
1/9 12:06:49 Received GoAhead from peer to send
C:\condor\execute\dir_3652\mystupid.err.
1/9 12:06:49 Sending GoAhead for 34.52.12.4 to receive
C:\condor\execute\dir_3652\mystupid.err and all further files.
1/9 12:06:49 ReliSock::put_file_with_permissions(): going to send permissions 0
1/9 12:06:49 put_file: going to send from filename
C:\condor\execute\dir_3652\mystupid.err
1/9 12:06:49 put_file: Found file size 0
1/9 12:06:49 put_file: sending 0 bytes
1/9 12:06:49 ReliSock: put_file: sent 0 bytes
1/9 12:06:49 DoUpload: send file mystupid.out
1/9 12:06:49 Received GoAhead from peer to send
C:\condor\execute\dir_3652\mystupid.out.
1/9 12:06:49 ReliSock::put_file_with_permissions(): going to send permissions 0
1/9 12:06:49 put_file: going to send from filename
C:\condor\execute\dir_3652\mystupid.out
1/9 12:06:49 put_file: Found file size 0
1/9 12:06:49 put_file: sending 0 bytes
1/9 12:06:49 ReliSock: put_file: sent 0 bytes
1/9 12:06:49 DoUpload: exiting at 2397
1/9 12:06:49 Inside OsProc::JobExit()
1/9 12:06:49 In OwnerProfile::loaded()
1/9 12:06:49 TokenCache contents:
condor-reuse-slot1@.
1/9 12:06:49 In VanillaProc::PublishUpdateAd()
1/9 12:06:49 Inside OsProc::PublishUpdateAd()
1/9 12:06:49 Sent job ClassAd update to startd.
1/9 12:06:49 In OwnerProfile::loaded()
1/9 12:06:49 Got SIGQUIT.  Performing fast shutdown.
1/9 12:06:49 ShutdownFast all jobs.
1/9 12:06:49 Got ShutdownFast when no jobs running.
1/9 12:06:49 Removing C:\condor\execute\dir_3652
1/9 12:06:49 Attempting to remove C:\condor\execute\dir_3652 as
SuperUser (system)
1/9 12:06:49 **** condor_starter (condor_STARTER) pid 3652 EXITING WITH STATUS 0
1/9 12:06:49 Deleting the StarterHookMgr

StartLog
--------

1/9 12:06:47 Adding to resolved authorization table: */34.52.12.4: DAEMON

1/9 12:06:47 slot1: Schedd addr = <34.52.12.4:33244>
1/9 12:06:47 slot1: Alive interval = 300
1/9 12:06:47 slot1: Received ClaimId from schedd
(<34.52.8.222:3423>#1231524275#1#...)
1/9 12:06:47 slot1: Rank of this claim is: 0.000000
1/9 12:06:47 slot1: Request accepted.
1/9 12:06:47 slot1: Remote owner is grant@xxxxxxxxxxxxxxxxxxxx
1/9 12:06:47 slot1: State change: claiming protocol successful
1/9 12:06:47 slot1: Changing state: Unclaimed -> Claimed
1/9 12:06:47 slot1: Started ClaimLease timer (18) w/ 1800 second lease duration
1/9 12:06:47 Adding to resolved authorization table: */34.52.8.226: NEGOTIATOR

1/9 12:06:47 slot1: match_info called
1/9 12:06:47 slot1: Got activate_claim request from shadow (<34.52.12.4:42290>)
1/9 12:06:47 slot1: Read request ad and starter from shadow.
1/9 12:06:47 Swap space: 4194303
1/9 12:06:47 slot2: Total execute space: 36827132
1/9 12:06:47 slot3: Total execute space: 36827132
1/9 12:06:47 slot4: Total execute space: 36827132
1/9 12:06:47 slot1: Total execute space: 36827132
1/9 12:06:47 slot1: Remote job ID is 112.0
1/9 12:06:47 slot1: Remote global job ID is
feynman.corp.halliburton.com#1231524415#112.0
1/9 12:06:47 slot1: JobLeaseDuration defined in job ClassAd: 1200
1/9 12:06:47 slot1: Resetting ClaimLease timer (18) with new duration
1/9 12:06:47 slot1: About to Create_Process "condor_starter -f -a slot1
feynman.corp.halliburton.com"
1/9 12:06:47 GetBinaryType() returned 0
1/9 12:06:47 GetBinaryType() returned 0
1/9 12:06:47 slot1: Got RemoteUser (grant@xxxxxxxxxxxxxxxxxxxx) from
request classad
1/9 12:06:47 slot1: Got universe "VANILLA" (5) from request classad
1/9 12:06:47 slot1: State change: claim-activation protocol successful
1/9 12:06:47 slot1: Changing activity: Idle -> Busy
1/9 12:06:47 Started polling timer.
1/9 12:06:48 Adding to resolved authorization table: */34.52.8.222: DAEMON

1/9 12:06:48 Received UDP command 60008 (DC_CHILDALIVE) from
<34.52.8.222:3449>, access level DAEMON
1/9 12:06:49 slot1: Received job ClassAd update from starter.
1/9 12:06:49 slot1: Closing job ClassAd update socket from starter.
1/9 12:06:49 slot1: Computing claimWorklifeExpired(); ClaimAge=2,
ClaimWorklife=600
1/9 12:06:49 slot1: Called deactivate_claim_forcibly()
1/9 12:06:49 slot1: In Starter::kill() with pid 3652, sig 3 (SIGQUIT)
1/9 12:06:49 condor_write(): Socket closed when trying to write 56 bytes to
<34.52.12.4:39340>, fd is 232
1/9 12:06:49 Buf::write(): condor_write() failed
1/9 12:06:49 Failed to send response ClassAd in deactivate_claim.
1/9 12:06:49 slot1: State change: received RELEASE_CLAIM command
1/9 12:06:49 slot1: Canceled ClaimLease timer (18)
1/9 12:06:49 slot1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
1/9 12:06:49 slot1: In Starter::kill() with pid 3652, sig 15 (SIGTERM)
1/9 12:06:49 Adding to resolved authorization table: */34.52.8.222: READ

1/9 12:06:49 Starter pid 3652 exited with status 0
1/9 12:06:49 slot1: Canceled hardkill-starter timer (25)
1/9 12:06:49 slot1: State change: starter exited
1/9 12:06:49 slot1: State change: No preempting claim, returning to owner
1/9 12:06:49 slot1: Changing state and activity: Preempting/Vacating
-> Owner/Idle
1/9 12:06:49 slot1: State change: IS_OWNER is false
1/9 12:06:49 slot1: Changing state: Owner -> Unclaimed
1/9 12:06:51 Trying to update collector <34.52.8.226:9618>
1/9 12:06:51 Attempting to send update via UDP to collector
crossroads.corp.halliburton.com <34.52.8.226:9618>
1/9 12:06:51 slot1: Sent update to 1 collector(s)
1/9 12:06:52 Canceled polling timer (21)

Thanks!

Please let me know what else you might need.  I'm going to be away
until Wednesday, but I'll be happy to provide whatever once I return.

Thanks again,
Grant
-- 
Grant Goodyear		
web: http://www.grantgoodyear.org	
e-mail: grant@xxxxxxxxxxxxxxxxx