[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs not executing



I get those type of exit codes when one of my executables throws an exception, doesn't have the required DLL's, etc.  I know your worker nodes didn't change much, but if this were me I would emulate the job without condor on one of the nodes.  i.e. create a folder called "condorTest" on one of the worker nodes, put the exe in there, put any transfer input files there you might have used (and no other files), open a command window/terminal and enter "yourprogram argument1 argument2..."  If it runs through fine, you have narrowed it down to some sort of configuration issue, if it doesn't, your configuration isn't the main problem.

Mike


On Mon, Jan 28, 2013 at 7:41 AM, <brad.32@xxxxxxxxxxx> wrote:
These are some of the parameters in the master condor_config:

CONDOR_HOST=$(FULL_HOSTNAME)
ALLOW_ADMINISTRATOR=$(IP_ADDRESS)
ALLOW_READ=*
ALLOW_WRITE=*
START=FALSE

Clients:
CONDOR_HOST=master ip
ALLOW_ADMINISTRATOR=master ip
ALLOW_READ=*
ALLOW_WRITE=*
START=TRUE

Anything else that would be helpful to know about this configuration?

Quoting brad.32

I'm trying to move the HTCondor master from one Windows 7 computer to another and still use the computers that were in the original pool, so uninstalled HTCondor on the both of these Win7 machines and reinstalled on what is to be the new master. The new master has a new condor_config.

In addition, I went to each client computer and changed the CONDOR_HOST to the new IP.

However, jobs are not executing in the pool anymore. Using this new master they were originally exiting with an ExitCode of -1073741515 which is apparently a strange Windows return code, but after checking the web for that, I added a condition for to trap for it and the jobs are requeuing but are now being evicted.

Any idea why they are not running?

Thank you.


The following is the job log for one submission cycle:

000 (002.000.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
000 (002.001.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
001 (002.000.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
006 (002.000.000) 01/25 13:46:43 Image size of job updated: 150
        0  -  MemoryUsage of job (MB)
        0  -  ResidentSetSize of job (KB)
...
001 (002.001.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
004 (002.000.000) 01/25 13:46:43 Job was evicted.
        (0) Job terminated and was requeued
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        3076563  -  Run Bytes Received By Job
        (1) Normal termination (return value -1073741515)
        The job attribute OnExitRemove _expression_ '( ExitCode != -1073741515 )' evaluated to FALSE
...
006 (002.001.000) 01/25 13:46:44 Image size of job updated: 150
        0  -  MemoryUsage of job (MB)
        0  -  ResidentSetSize of job (KB)
...
004 (002.001.000) 01/25 13:46:44 Job was evicted.
        (0) Job terminated and was requeued
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        3076563  -  Run Bytes Received By Job
        (1) Normal termination (return value -1073741515)
        The job attribute OnExitRemove _expression_ '( ExitCode != -1073741515 )' evaluated to FALSE
...

The matching section in the ShadowLog on the master:

01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** PID = 852
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 ** PID = 3308
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53107>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53107>
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.1
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.0
01/25/13 13:46:41 (2.0) (852): Request to run on slot1@clienthost <clientIP:1064> was ACCEPTED
01/25/13 13:46:41 (2.1) (3308): Request to run on slot2@clienthost <clientIP:1064> was ACCEPTED
01/25/13 13:46:41 (2.0) (852): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: Failed to execute C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: failed to add plugin "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:43 (2.0) (852): Job 2.0 is being put back in the job queue: The job attribute OnExitRemove _expression_ '( ExitCode != -1073741515 )' evaluated to FALSE
01/25/13 13:46:43 (2.0) (852): **** condor_shadow (condor_SHADOW) pid 852 EXITING WITH STATUS 107
01/25/13 13:46:44 (2.1) (3308): Job 2.1 is being put back in the job queue: The job attribute OnExitRemove _expression_ '( ExitCode != -1073741515 )' evaluated to FALSE
01/25/13 13:46:44 (2.1) (3308): **** condor_shadow (condor_SHADOW) pid 3308 EXITING WITH STATUS 107




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/