[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs not executing



Did you update the ALLOW_READ/WRITE knobs as well?

On Fri, Jan 25, 2013 at 4:53 PM,  <brad.32@xxxxxxxxxxx> wrote:
> I'm trying to move the HTCondor master from one Windows 7 computer to
> another and still use the computers that were in the original pool, so
> uninstalled HTCondor on the both of these Win7 machines and reinstalled on
> what is to be the new master. The new master has a new condor_config.
>
> In addition, I went to each client computer and changed the CONDOR_HOST to
> the new IP.
>
> However, jobs are not executing in the pool anymore. Using this new master
> they were originally exiting with an ExitCode of -1073741515 which is
> apparently a strange Windows return code, but after checking the web for
> that, I added a condition for to trap for it and the jobs are requeuing but
> are now being evicted.
>
> Any idea why they are not running?
>
> Thank you.
>
>
> The following is the job log for one submission cycle:
>
> 000 (002.000.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
> ...
> 000 (002.001.000) 01/25 13:46:41 Job submitted from host: <masterIP:53059>
> ...
> 001 (002.000.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
> ...
> 006 (002.000.000) 01/25 13:46:43 Image size of job updated: 150
>         0  -  MemoryUsage of job (MB)
>         0  -  ResidentSetSize of job (KB)
> ...
> 001 (002.001.000) 01/25 13:46:43 Job executing on host: <clientIP:1064>
> ...
> 004 (002.000.000) 01/25 13:46:43 Job was evicted.
>         (0) Job terminated and was requeued
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>         0  -  Run Bytes Sent By Job
>         3076563  -  Run Bytes Received By Job
>         (1) Normal termination (return value -1073741515)
>         The job attribute OnExitRemove expression '( ExitCode != -1073741515
> )' evaluated to FALSE
> ...
> 006 (002.001.000) 01/25 13:46:44 Image size of job updated: 150
>         0  -  MemoryUsage of job (MB)
>         0  -  ResidentSetSize of job (KB)
> ...
> 004 (002.001.000) 01/25 13:46:44 Job was evicted.
>         (0) Job terminated and was requeued
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>         0  -  Run Bytes Sent By Job
>         3076563  -  Run Bytes Received By Job
>         (1) Normal termination (return value -1073741515)
>         The job attribute OnExitRemove expression '( ExitCode != -1073741515
> )' evaluated to FALSE
> ...
>
> The matching section in the ShadowLog on the master:
>
> 01/25/13 13:46:41 Locale: English_United States.1252
> 01/25/13 13:46:41 Setting maximum accepts per cycle 8.
> 01/25/13 13:46:41 Locale: English_United States.1252
> 01/25/13 13:46:41 ******************************************************
> 01/25/13 13:46:41 Setting maximum accepts per cycle 8.
> 01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 01/25/13 13:46:41 ******************************************************
> 01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
> 01/25/13 13:46:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
> class=DAEMON(1)
> 01/25/13 13:46:41 ** C:\condor\bin\condor_shadow.exe
> 01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE>
> class:DAEMON
> 01/25/13 13:46:41 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
> class=DAEMON(1)
> 01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
> 01/25/13 13:46:41 ** Configuration: subsystem:SHADOW local:<NONE>
> class:DAEMON
> 01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
> 01/25/13 13:46:41 ** $CondorVersion: 7.8.4 Sep 18 2012 BuildID: 64675 $
> 01/25/13 13:46:41 ** PID = 852
> 01/25/13 13:46:41 ** $CondorPlatform: x86_64_winnt_6.1 $
> 01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
> 01/25/13 13:46:41 ** PID = 3308
> 01/25/13 13:46:41 ******************************************************
> 01/25/13 13:46:41 ** Log last touched 1/25 13:12:39
> 01/25/13 13:46:41 Using config source: C:\condor\condor_config
> 01/25/13 13:46:41 ******************************************************
> 01/25/13 13:46:41 Using local config sources:
> 01/25/13 13:46:41 Using config source: C:\condor\condor_config
> 01/25/13 13:46:41    C:\condor/condor_config.local
> 01/25/13 13:46:41 Using local config sources:
> 01/25/13 13:46:41    C:\condor/condor_config.local
> 01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53106>
> 01/25/13 13:46:41 DaemonCore: command socket at <masterIP:53107>
> 01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53106>
> 01/25/13 13:46:41 DaemonCore: private command socket at <masterIP:53107>
> 01/25/13 13:46:41 Setting maximum accepts per cycle 8.
> 01/25/13 13:46:41 Setting maximum accepts per cycle 8.
> 01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.1
> 01/25/13 13:46:41 Initializing a VANILLA shadow for job 2.0
> 01/25/13 13:46:41 (2.0) (852): Request to run on slot1@clienthost
> <clientIP:1064> was ACCEPTED
> 01/25/13 13:46:41 (2.1) (3308): Request to run on slot2@clienthost
> <clientIP:1064> was ACCEPTED
> 01/25/13 13:46:41 (2.0) (852): my_popen: CreateProcess failed
> 01/25/13 13:46:41 (2.0) (852): FILETRANSFER: Failed to execute
> C:\condor/bin/curl_plugin, ignoring
> 01/25/13 13:46:41 (2.0) (852): FILETRANSFER: failed to add plugin
> "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
> C:\condor/bin/curl_plugin, ignoring
> 01/25/13 13:46:41 (2.1) (3308): my_popen: CreateProcess failed
> 01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: Failed to execute
> C:\condor/bin/curl_plugin, ignoring
> 01/25/13 13:46:41 (2.1) (3308): FILETRANSFER: failed to add plugin
> "C:\condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
> C:\condor/bin/curl_plugin, ignoring
> 01/25/13 13:46:43 (2.0) (852): Job 2.0 is being put back in the job queue:
> The job attribute OnExitRemove expression '( ExitCode != -1073741515 )'
> evaluated to FALSE
> 01/25/13 13:46:43 (2.0) (852): **** condor_shadow (condor_SHADOW) pid 852
> EXITING WITH STATUS 107
> 01/25/13 13:46:44 (2.1) (3308): Job 2.1 is being put back in the job queue:
> The job attribute OnExitRemove expression '( ExitCode != -1073741515 )'
> evaluated to FALSE
> 01/25/13 13:46:44 (2.1) (3308): **** condor_shadow (condor_SHADOW) pid 3308
> EXITING WITH STATUS 107
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Condor Project Windows Developer