[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs not executing



On 25-Jan-13 20:21, Ziliang Guo wrote:
Did you update the ALLOW_READ/WRITE knobs as well?
Those two were/are set to "*".

On Fri, Jan 25, 2013 at 4:53 PM,  <brad.32@xxxxxxxxxxx> wrote:
I'm trying to move the HTCondor master from one Windows 7 computer to
another and still use the computers that were in the original pool, so
uninstalled HTCondor on the both of these Win7 machines and reinstalled on
what is to be the new master. The new master has a new condor_config.

In addition, I went to each client computer and changed the CONDOR_HOST to
the new IP.

However, jobs are not executing in the pool anymore. Using this new master
they were originally exiting with an ExitCode of -1073741515 which is
apparently a strange Windows return code, but after checking the web for
that, I added a condition for to trap for it and the jobs are requeuing but
are now being evicted.

Any idea why they are not running?

Thank you.


The following is the job log for one submission cycle:

000 (002.000.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
000 (002.001.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
...
001 (002.000.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
006 (002.000.000) 01/25 13:46:43 Image size of job updated: 150
         0  -  MemoryUsage of job (MB)
         0  -  ResidentSetSize of job (KB)
...
001 (002.001.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
...
004 (002.000.000) 01/25 13:46:43 Job was evicted.
         (0) Job terminated and was requeued
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
         0  -  Run Bytes Sent By Job
         3076563  -  Run Bytes Received By Job
         (1) Normal termination (return value -1073741515)
         The job attribute OnExitRemove expression '( ExitCode != -1073741515
)' evaluated to FALSE
...
006 (002.001.000) 01/25 13:46:44 Image size of job updated: 150
         0  -  MemoryUsage of job (MB)
         0  -  ResidentSetSize of job (KB)
...
004 (002.001.000) 01/25 13:46:44 Job was evicted.
         (0) Job terminated and was requeued
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
         0  -  Run Bytes Sent By Job
         3076563  -  Run Bytes Received By Job
         (1) Normal termination (return value -1073741515)
         The job attribute OnExitRemove expression '( ExitCode != -1073741515
)' evaluated to FALSE
...

The matching section in the ShadowLog on the master:

01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Locale: English_United States.1252
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
01/25/13 13:46:41 ** PID = 852
01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 ** PID = 3308
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41 ******************************************************
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41 Using config source: C:\condor\condor_config
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 Using local config sources:
01/25/13 13:46:41    C:\condor/condor_config.local
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53107>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53106>
01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53107>
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Setting maximum accepts per cycle 8.
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.1
01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.0
01/25/13 13:46:41 (2.0) (852): Request to run on slot1@clienthost
<clientIP:1064> was ACCEPTED
01/25/13 13:46:41 (2.1) (3308): Request to run on slot2@clienthost
<clientIP:1064> was ACCEPTED
01/25/13 13:46:41 (2.0) (852): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.0) (852): FILETRANSFER: failed to add plugin
"C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): my_popen: CreateProcess failed
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: failed to add plugin
"C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
C:\condor/bin/curl_plugin, ignoring
01/25/13 13:46:43 (2.0) (852): Job 2.0 is being put back in the job queue:
The job attribute OnExitRemove expression '( ExitCode != -1073741515 )'
evaluated to FALSE
01/25/13 13:46:43 (2.0) (852): **** condor_shadow (condor_SHADOW) pid 852
EXITING WITH STATUS 107
01/25/13 13:46:44 (2.1) (3308): Job 2.1 is being put back in the job queue:
The job attribute OnExitRemove expression '( ExitCode != -1073741515 )'
evaluated to FALSE
01/25/13 13:46:44 (2.1) (3308): **** condor_shadow (condor_SHADOW) pid 3308
EXITING WITH STATUS 107


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/