[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] ProcAPI sanity failure, age = -98161996



Hi,

I am new to condor, trying a very simple experiment on a set of windows machines (note, condor works out of the box on my linux cluster - I am struggling with Windows 2003 Server).

I installed: condor-6.7.17-winnt50-x86.msi

I am trying to run this very simple batch file:


============
@echo off
set mypath=C:\WINDOWS\system32
set mypath2=C:\WINNT\system32

set executable=ipconfig.exe
set command=ipconfig
CALL :EXECUTE

rem *
rem * execute other system commands similarly here
rem *

GOTO END

:EXECUTE
  IF EXIST %mypath%\%executable% (
    %mypath%\%command% 
  ) ELSE ( 
    IF EXIST %mypath2%\%executable% (
      %mypath2%\%command%
    ) ELSE ( 
      echo ERROR: executable %executable% not found
    )  
  )
GOTO :EOF

:END
============

With the following submit:

============
universe = vanilla
requirements = ( Arch=="Intel") && ( OpSys=="WINNT52" )
executable = test2.bat
output = hello$(Process).out
errror = hello$(Process).err
transfer_files = ON_EXIT
log = hello.log
notification = Error
queue 20
============

If I look at my hello.log file, some (not all) are dieing with the error "Create_Process" failed:

============
05 (013.000.000) 03/13 10:57:05 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	291  -  Run Bytes Sent By Job
	473  -  Run Bytes Received By Job
	291  -  Total Bytes Sent By Job
	473  -  Total Bytes Received By Job
...
001 (013.001.000) 03/13 10:57:07 Job executing on host: <10.40.12.183:1737>
...
007 (013.001.000) 03/13 10:57:07 Shadow exception!
	Error from starter on vm2@xxxxxxxxxxxxxxxxxxx: Create_Process(C:\WINDOWS\system32\cmd.exe,condor_exec.exe /Q /C condor_exec.bat, ...) failed
	0  -  Run Bytes Sent By Job
	473  -  Run Bytes Received By Job
...
001 (013.002.000) 03/13 10:57:08 Job executing on host: <10.40.12.184:1866>
...
007 (013.002.000) 03/13 10:57:08 Shadow exception!
	Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Create_Process(C:\WINDOWS\system32\cmd.exe,condor_exec.exe /Q /C condor_exec.bat, ...) failed
	0  -  Run Bytes Sent By Job
	473  -  Run Bytes Received By Job
...
============

Here's the shadow log on the submit machine - I am not sure if that helps... 

============
3/13 11:02:33 ******************************************************
3/13 11:02:33 ** condor_shadow (CONDOR_SHADOW) STARTING UP
3/13 11:02:33 ** C:\condor\bin\condor_shadow.exe
3/13 11:02:33 ** $CondorVersion: 6.7.17 Feb 18 2006 $
3/13 11:02:33 ** $CondorPlatform: INTEL-WINNT50 $
3/13 11:02:33 ** PID = 2764
3/13 11:02:33 ******************************************************
3/13 11:02:33 Using config file: C:\condor\condor_config
3/13 11:02:33 Using local config files: C:\condor/condor_config.local
3/13 11:02:33 DaemonCore: Command Socket at <10.40.12.183:1542>
3/13 11:02:33 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of 3600
3/13 11:02:33 Reading job ClassAd from STDIN
3/13 11:02:33 Initializing a VANILLA shadow for job 13.3
3/13 11:02:33 (13.3) (2764): init_user_ids: want user 'magala@CARYNT', current is '(null)@(null)'
3/13 11:02:33 (13.3) (2764): Found credential for user 'magala'
3/13 11:02:33 (13.3) (2764): LogonUser completed.
3/13 11:02:33 (13.3) (2764): TokenCache contents: 
magala@CARYNT
3/13 11:02:33 (13.3) (2764): ENABLE_USERLOG_LOCKING is undefined, using default value of True
3/13 11:02:33 (13.3) (2764): UserLog = C:\condor\test\hello.log
3/13 11:02:33 (13.3) (2764): *** Reserved Swap = 5120
3/13 11:02:33 (13.3) (2764): *** Free Swap = 4194303
3/13 11:02:33 (13.3) (2764): init_user_ids: want user 'magala@CARYNT', current is 'magala@CARYNT'
3/13 11:02:33 (13.3) (2764): init_user_ids: Already have handle for magala@CARYNT, so returning.
3/13 11:02:33 (13.3) (2764): TokenCache contents: 
magala@CARYNT
3/13 11:02:33 (13.3) (2764): in RemoteResource::initStartdInfo()
3/13 11:02:33 (13.3) (2764): SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
3/13 11:02:33 (13.3) (2764): Entering DCStartd::activateClaim()
3/13 11:02:33 (13.3) (2764): DCStartd::activateClaim: successfully sent command, reply is: 1
3/13 11:02:33 (13.3) (2764): Request to run on <10.40.12.184:1866> was ACCEPTED
3/13 11:02:33 (13.3) (2764): Resource vm1@xxxxxxxxxxxxxxxxxxx changing state from PRE to STARTUP
3/13 11:02:33 (13.3) (2764): Getting monitoring info for pid 2764
3/13 11:02:33 (13.3) (2764): ProcAPI sanity failure, cpuusage = -0.000000
3/13 11:02:33 (13.3) (2764): ProcAPI sanity failure, age = -98162337
3/13 11:02:33 (13.3) (2764): entering FileTransfer::Init
3/13 11:02:33 (13.3) (2764): entering FileTransfer::SimpleInit
3/13 11:02:33 (13.3) (2764): Entering FileTransfer::InitDownloadFilenameRemaps
3/13 11:02:33 (13.3) (2764): entering FileTransfer::HandleCommands
3/13 11:02:33 (13.3) (2764): FileTransfer::HandleCommands read transkey=1#4415979929766e50
3/13 11:02:33 (13.3) (2764): entering FileTransfer::Upload
3/13 11:02:33 (13.3) (2764): entering FileTransfer::DoUpload
3/13 11:02:33 (13.3) (2764): DoUpload: send file C:\condor/spool\cluster13.ickpt.subproc0
3/13 11:02:33 (13.3) (2764): ReliSock::put_file_with_permissions(): going to send permissions 0
3/13 11:02:33 (13.3) (2764): put_file: going to send from filename C:\condor/spool\cluster13.ickpt.subproc0
3/13 11:02:33 (13.3) (2764): put_file: Found file size 473
3/13 11:02:33 (13.3) (2764): put_file: senting 473 bytes
3/13 11:02:33 (13.3) (2764): ReliSock: put_file: sent 473 bytes
3/13 11:02:33 (13.3) (2764): DoUpload: exiting at 1723
3/13 11:02:34 (13.3) (2764): DaemonCore: in SendAliveToParent()
3/13 11:02:34 (13.3) (2764): DaemonCore: attempting to connect to '<10.40.12.183:1736>'
3/13 11:02:34 (13.3) (2764): SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
3/13 11:02:34 (13.3) (2764): SEC_TCP_SESSION_TIMEOUT is undefined, using default value of 20
3/13 11:02:34 (13.3) (2764): In handleJobRemoval(), sig 10
3/13 11:02:34 (13.3) (2764): setting exit reason on vm1@xxxxxxxxxxxxxxxxxxx to 102
3/13 11:02:34 (13.3) (2764): Resource vm1@xxxxxxxxxxxxxxxxxxx changing state from STARTUP to FINISHED
3/13 11:02:34 (13.3) (2764): Entering DCStartd::deactivateClaim(forceful)
3/13 11:02:34 (13.3) (2764): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
3/13 11:02:34 (13.3) (2764): DCStartd::deactivateClaim: successfully sent command
3/13 11:02:34 (13.3) (2764): Killed starter (fast) at <10.40.12.184:1866>
3/13 11:02:54 (13.3) (2764): Resource vm1@xxxxxxxxxxxxxxxxxxx changing state from FINISHED to EXECUTING
3/13 11:02:54 (13.3) (2764): SHADOW_QUEUE_UPDATE_INTERVAL is undefined, using default value of 900
3/13 11:02:54 (13.3) (2764): QmgrJobUpdater: started timer to update queue (tid=6)
3/13 11:02:54 (13.3) (2764): ERROR "Error from starter on vm1@xxxxxxxxxxxxxxxxxxx: Create_Process(C:\WINDOWS\system32\cmd.exe,condor_exec.exe /Q /C condor_exec.bat, ...) failed" at line 597 in file ..\src\condor_shadow.V6.1\pseudo_ops.C
============

In the MasterLog, I also keep seeing the following: "ProcAPI sanity failure, age = xxxx". This error seems serious.


============
3/13 11:12:59 Time stamp of running C:\condor/bin/condor_master.exe: 1140256862
3/13 11:12:59 GetTimeStamp returned: 1140256862
3/13 11:12:59 Time stamp of running C:\condor/bin/condor_collector.exe: 1140256858
3/13 11:12:59 GetTimeStamp returned: 1140256858
3/13 11:12:59 Time stamp of running C:\condor/bin/condor_negotiator.exe: 1140256862
3/13 11:12:59 GetTimeStamp returned: 1140256862
3/13 11:12:59 Time stamp of running C:\condor/bin/condor_schedd.exe: 1140256866
3/13 11:12:59 GetTimeStamp returned: 1140256866
3/13 11:12:59 Time stamp of running C:\condor/bin/condor_startd.exe: 1140256866
3/13 11:12:59 GetTimeStamp returned: 1140256866
3/13 11:12:59 exit Daemons::CheckForNewExecutable
3/13 11:13:13 ProcAPI sanity failure, age = -98161996
3/13 11:13:13 ProcAPI sanity failure, age = -98161996
3/13 11:13:13 ProcAPI sanity failure, age = -98161996
3/13 11:13:13 ProcAPI sanity failure, age = -98161996
3/13 11:13:14 ProcAPI sanity failure, age = -98161995
3/13 11:13:14 ProcAPI sanity failure, age = -98161995
============

Any ideas? 

Thanks,
Matthew