[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs not executing



Yes, the problem was a DLL that I was not copying over with the executable. I had updated Postgres as part of this switching process and that was my mistake. One of the needed DLLs had changed names but I didn't know that and had expected such a problem to show up in the error log.

The jobs seem to be working now, and on all the worker nodes.

Thank you for the suggestion.


Quoting "Michael Aschenbeck" <m.g.aschenbeck@xxxxxxxxx>:

I get those type of exit codes when one of my executables throws an
exception, doesn't have the required DLL's, etc.  I know your worker nodes
didn't change much, but if this were me I would emulate the job without
condor on one of the nodes.  i.e. create a folder called "condorTest" on
one of the worker nodes, put the exe in there, put any transfer input files
there you might have used (and no other files), open a command
window/terminal and enter "yourprogram argument1 argument2..."  If it runs
through fine, you have narrowed it down to some sort of configuration
issue, if it doesn't, your configuration isn't the main problem.

Mike


On Mon, Jan 28, 2013 at 7:41 AM, <brad.32@xxxxxxxxxxx> wrote:

These are some of the parameters in the master condor_config:

CONDOR_HOST=$(FULL_HOSTNAME)
ALLOW_ADMINISTRATOR=$(IP_**ADDRESS)
ALLOW_READ=*
ALLOW_WRITE=*
START=FALSE

Clients:
CONDOR_HOST=master ip
ALLOW_ADMINISTRATOR=master ip
ALLOW_READ=*
ALLOW_WRITE=*
START=TRUE

Anything else that would be helpful to know about this configuration?

Quoting brad.32

 I'm trying to move the HTCondor master from one Windows 7 computer to
another and still use the computers that were in the original pool, so
uninstalled HTCondor on the both of these Win7 machines and reinstalled on
what is to be the new master. The new master has a new condor_config.

In addition, I went to each client computer and changed the CONDOR_HOST
to the new IP.

However, jobs are not executing in the pool anymore. Using this new
master they were originally exiting with an ExitCode of -1073741515 which
is apparently a strange Windows return code, but after checking the web for
that, I added a condition for to trap for it and the jobs are requeuing but
are now being evicted.

Any idea why they are not running?

Thank you.


The following is the job log for one submission cycle:

000 (002.000.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
000 (002.001.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
001 (002.000.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
006 (002.000.000) 01/25 13:46:43 Image size of job updated: 150
        0  -  MemoryUsage of job (MB)
        0  -  ResidentSetSize of job (KB)
...
001 (002.001.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
004 (002.000.000) 01/25 13:46:43 Job was evicted.
        (0) Job terminated and was requeued
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        3076563  -  Run Bytes Received By Job
        (1) Normal termination (return value -1073741515)
        The job attribute OnExitRemove expression '( ExitCode !=
-1073741515 )' evaluated to FALSE
...
006 (002.001.000) 01/25 13:46:44 Image size of job updated: 150
        0  -  MemoryUsage of job (MB)
        0  -  ResidentSetSize of job (KB)
...
004 (002.001.000) 01/25 13:46:44 Job was evicted.
        (0) Job terminated and was requeued
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        3076563  -  Run Bytes Received By Job
        (1) Normal termination (return value -1073741515)
        The job attribute OnExitRemove expression '( ExitCode !=
-1073741515 )' evaluated to FALSE
...

The matching section in the ShadowLog on the master:

01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 ********************************
************************
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ********************************
************************
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.**exe
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.**exe
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** PID = 852
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 ** PID = 3308
01/25/13 13:46:41 ********************************
************************
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41 ********************************
************************
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53107>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53107>
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.1
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.0
01/25/13 13:46:41 (2.0) (852): Request to run on slot1@clienthost<clientIP:1064> was ACCEPTED 01/25/13 13:46:41 (2.1) (3308): Request to run on slot2@clienthost<clientIP:1064> was ACCEPTED
01/25/13 13:46:41 (2.0) (852): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: failed to add plugin
"C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: failed to add plugin
"C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:43 (2.0) (852): Job 2.0 is being put back in the job
queue: The job attribute OnExitRemove expression '( ExitCode != -1073741515
)' evaluated to FALSE
01/25/13 13:46:43 (2.0) (852): **** condor_shadow (condor_SHADOW) pid 852
EXITING WITH STATUS 107
01/25/13 13:46:44 (2.1) (3308): Job 2.1 is being put back in the job
queue: The job attribute OnExitRemove expression '( ExitCode != -1073741515
)' evaluated to FALSE
01/25/13 13:46:44 (2.1) (3308): **** condor_shadow (condor_SHADOW) pid
3308 EXITING WITH STATUS 107




______________________________**_________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.**wisc.edu<htcondor-users-request@xxxxxxxxxxx>with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/**mailman/listinfo/htcondor-**users<https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

The archives can be found at:
https://lists.cs.wisc.edu/**archive/htcondor-users/<https://lists.cs.wisc.edu/archive/htcondor-users/>