[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs not executing



These are some of the parameters in the master condor_config:

CONDOR_HOST=$(FULL_HOSTNAME)
ALLOW_ADMINISTRATOR=$(IP_ADDRESS)
ALLOW_READ=*
ALLOW_WRITE=*
START=FALSE

Clients:
CONDOR_HOST=master ip
ALLOW_ADMINISTRATOR=master ip
ALLOW_READ=*
ALLOW_WRITE=*
START=TRUE

Anything else that would be helpful to know about this configuration?

Quoting brad.32
I'm trying to move the HTCondor master from one Windows 7 computer to another and still use the computers that were in the original pool, so uninstalled HTCondor on the both of these Win7 machines and reinstalled on what is to be the new master. The new master has a new condor_config.

In addition, I went to each client computer and changed the CONDOR_HOST to the new IP.

However, jobs are not executing in the pool anymore. Using this new master they were originally exiting with an ExitCode of -1073741515 which is apparently a strange Windows return code, but after checking the web for that, I added a condition for to trap for it and the jobs are requeuing but are now being evicted.

Any idea why they are not running?

Thank you.


The following is the job log for one submission cycle:

000 (002.000.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
000 (002.001.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
001 (002.000.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
006 (002.000.000) 01/25 13:46:43 Image size of job updated: 150
	0  -  MemoryUsage of job (MB)
	0  -  ResidentSetSize of job (KB)
...
001 (002.001.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
004 (002.000.000) 01/25 13:46:43 Job was evicted.
	(0) Job terminated and was requeued
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	3076563  -  Run Bytes Received By Job
	(1) Normal termination (return value -1073741515)
The job attribute OnExitRemove expression '( ExitCode != -1073741515 )' evaluated to FALSE
...
006 (002.001.000) 01/25 13:46:44 Image size of job updated: 150
	0  -  MemoryUsage of job (MB)
	0  -  ResidentSetSize of job (KB)
...
004 (002.001.000) 01/25 13:46:44 Job was evicted.
	(0) Job terminated and was requeued
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	3076563  -  Run Bytes Received By Job
	(1) Normal termination (return value -1073741515)
The job attribute OnExitRemove expression '( ExitCode != -1073741515 )' evaluated to FALSE
...

The matching section in the ShadowLog on the master:

01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** PID = 852
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 ** PID = 3308
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53107>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53107>
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.1
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.0
01/25/13 13:46:41 (2.0) (852): Request to run on slot1@clienthost <clientIP:1064> was ACCEPTED 01/25/13 13:46:41 (2.1) (3308): Request to run on slot2@clienthost <clientIP:1064> was ACCEPTED
01/25/13 13:46:41 (2.0) (852): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring 01/25/13 13:46:41 (2.0) (852): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring 01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring 01/25/13 13:46:43 (2.0) (852): Job 2.0 is being put back in the job queue: The job attribute OnExitRemove expression '( ExitCode != -1073741515 )' evaluated to FALSE 01/25/13 13:46:43 (2.0) (852): **** condor_shadow (condor_SHADOW) pid 852 EXITING WITH STATUS 107 01/25/13 13:46:44 (2.1) (3308): Job 2.1 is being put back in the job queue: The job attribute OnExitRemove expression '( ExitCode != -1073741515 )' evaluated to FALSE 01/25/13 13:46:44 (2.1) (3308): **** condor_shadow (condor_SHADOW) pid 3308 EXITING WITH STATUS 107