[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT



On Wed, 23 Jun 2004 21:53:51 -0500, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

It looks like the globus-url-copy to stage out the job files is hanging on the gatekeeper machine. We've seen this problem with globus-url-copy in other situations, but haven't been able to determine the cause. If you could add/change the following line to your condor_config file, reproduce the problem, and send me the resulting gridmanager log file, it'd be a great help in figuring out how to make condor-g better at working around this problem:

GRIDMANAGER_DEBUG = D_FULLDEBUG

Thanks for the reply,


I've been trying to recreate the problem, with full debugging, but I'm noticing two different outputs. In one case, the job is not attempting to restart and netstat shows no connections on the condor-g machine to the remote resource, but the remote resource shows CLOSE_WAIT connections with the condor-g machine. This is the tail end of the GridmanagerLog file for that:

6/24 10:16:47 [22515] Updating classad values for 859.0:
6/24 10:16:47 [22515] GlobusStatus = 128
6/24 10:16:47 [22515] leaving doContactSchedd()
6/24 10:16:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 10:17:37 [22515] GAHP <- 'RESULTS'
6/24 10:17:37 [22515] GAHP -> 'S' '0'
6/24 10:17:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:18:37 [22515] GAHP <- 'RESULTS'
6/24 10:18:37 [22515] GAHP -> 'S' '0'
6/24 10:18:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:18:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 10:19:37 [22515] GAHP <- 'RESULTS'
6/24 10:19:37 [22515] GAHP -> 'S' '0'
6/24 10:19:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:20:16 [22515] (859.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128
6/24 10:20:16 [22515] (859.0) gm state change: GM_SUBMITTED -> GM_PROBE_JOBMANAGER
6/24 10:20:16 [22515] GAHP <- 'GRAM_JOB_STATUS 25 https://mercury.uvic.ca:40053/9165/1088097296/'
6/24 10:20:16 [22515] GAHP -> 'S'
6/24 10:20:17 [22515] GAHP <- 'RESULTS'
6/24 10:20:17 [22515] GAHP -> 'S' '1'
6/24 10:20:17 [22515] GAHP -> '25' '0' '0' '128'
6/24 10:20:17 [22515] (859.0) doEvaluateState called: gmState GM_PROBE_JOBMANAGER, globusState 128
6/24 10:20:17 [22515] (859.0) gm state change: GM_PROBE_JOBMANAGER -> GM_SUBMITTED
6/24 10:20:37 [22515] GAHP <- 'RESULTS'
6/24 10:20:37 [22515] GAHP -> 'S' '0'
6/24 10:20:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:20:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 10:21:37 [22515] GAHP <- 'RESULTS'
6/24 10:21:37 [22515] GAHP -> 'S' '0'
6/24 10:21:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:22:37 [22515] GAHP <- 'RESULTS'
6/24 10:22:37 [22515] GAHP -> 'S' '0'
6/24 10:22:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:22:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 10:23:37 [22515] GAHP <- 'RESULTS'
6/24 10:23:37 [22515] GAHP -> 'S' '0'
6/24 10:23:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:24:37 [22515] GAHP <- 'RESULTS'
6/24 10:24:37 [22515] GAHP -> 'S' '0'
6/24 10:24:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:24:40 [22515] CheckProxies called
6/24 10:24:40 [22515] will call CheckProxies again in 600 seconds
6/24 10:24:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 10:25:17 [22515] (859.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128
6/24 10:25:17 [22515] (859.0) gm state change: GM_SUBMITTED -> GM_PROBE_JOBMANAGER
6/24 10:25:17 [22515] GAHP <- 'GRAM_JOB_STATUS 26 https://mercury.uvic.ca:40053/9165/1088097296/'
6/24 10:25:17 [22515] GAHP -> 'S'
6/24 10:25:17 [22515] GAHP <- 'RESULTS'
6/24 10:25:17 [22515] GAHP -> 'S' '1'
6/24 10:25:17 [22515] GAHP -> '26' '0' '0' '128'
6/24 10:25:17 [22515] (859.0) doEvaluateState called: gmState GM_PROBE_JOBMANAGER, globusState 128
6/24 10:25:17 [22515] (859.0) gm state change: GM_PROBE_JOBMANAGER -> GM_SUBMITTED
6/24 10:25:37 [22515] GAHP <- 'RESULTS'
6/24 10:25:37 [22515] GAHP -> 'S' '0'
6/24 10:25:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:26:37 [22515] GAHP <- 'RESULTS'
6/24 10:26:37 [22515] GAHP -> 'S' '0'
6/24 10:26:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:26:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 10:27:37 [22515] GAHP <- 'RESULTS'
6/24 10:27:37 [22515] GAHP -> 'S' '0'
6/24 10:27:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:28:37 [22515] GAHP <- 'RESULTS'
6/24 10:28:37 [22515] GAHP -> 'S' '0'
6/24 10:28:38 [22515] checkResources(): 1 resources, 0 are down
6/24 10:28:52 [22515] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.



If I let it keep going, similar messages will just be repeated.


However, in other cases a job restart is attempted (this is the one I have noticed many times before, where netstat shows connections on both sides and a new log file on remote resource is created every minute). This is the corresponding tail end of the GridmanagerLog file:

6/24 08:39:37 [18272] GAHP <- 'RESULTS'
6/24 08:39:37 [18272] GAHP -> 'S' '0'
6/24 08:39:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:40:20 [18272] (814.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128
6/24 08:40:20 [18272] (814.0) gm state change: GM_SUBMITTED -> GM_PROBE_JOBMANAGER
6/24 08:40:20 [18272] GAHP <- 'GRAM_JOB_STATUS 59 https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:40:20 [18272] GAHP -> 'S'
6/24 08:40:20 [18272] GAHP <- 'RESULTS'
6/24 08:40:20 [18272] GAHP -> 'S' '1'
6/24 08:40:20 [18272] GAHP -> '59' '0' '0' '128'
6/24 08:40:20 [18272] (814.0) doEvaluateState called: gmState GM_PROBE_JOBMANAGER, globusState 128
6/24 08:40:20 [18272] (814.0) gm state change: GM_PROBE_JOBMANAGER -> GM_SUBMITTED
6/24 08:40:37 [18272] GAHP <- 'RESULTS'
6/24 08:40:37 [18272] GAHP -> 'S' '0'
6/24 08:40:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:40:52 [18272] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 08:41:37 [18272] GAHP <- 'RESULTS'
6/24 08:41:37 [18272] GAHP -> 'S' '0'
6/24 08:41:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:42:37 [18272] GAHP <- 'RESULTS'
6/24 08:42:37 [18272] GAHP -> 'S' '0'
6/24 08:42:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:42:52 [18272] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 08:43:37 [18272] GAHP <- 'RESULTS'
6/24 08:43:37 [18272] GAHP -> 'S' '0'
6/24 08:43:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:44:37 [18272] GAHP <- 'RESULTS'
6/24 08:44:37 [18272] GAHP -> 'S' '0'
6/24 08:44:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:44:40 [18272] CheckProxies called
6/24 08:44:40 [18272] (re)caching proxy 1
6/24 08:44:40 [18272] GAHP <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u505'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] GAHP <- 'USE_CACHED_PROXY 1'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] will call CheckProxies again in 600 seconds
6/24 08:44:40 [18272] (814.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128
6/24 08:44:40 [18272] (814.0) gm state change: GM_SUBMITTED -> GM_REFRESH_PROXY
6/24 08:44:40 [18272] GAHP <- 'GRAM_JOB_REFRESH_PROXY 60 https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] GAHP <- 'RESULTS'
6/24 08:44:40 [18272] GAHP -> 'S' '1'
6/24 08:44:40 [18272] GAHP -> '60' '10'
6/24 08:44:40 [18272] (814.0) doEvaluateState called: gmState GM_REFRESH_PROXY, globusState 128
6/24 08:44:40 [18272] (814.0) gmState GM_REFRESH_PROXY, globusState 128: refresh_credentials() returned Globus error 10
6/24 08:44:40 [18272] (814.0) gm state change: GM_REFRESH_PROXY -> GM_STOP_AND_RESTART
6/24 08:44:40 [18272] GAHP <- 'GRAM_JOB_SIGNAL 61 https://mercury.uvic.ca:40023/30935/1088091305/ 9 NULL'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:40 [18272] GAHP <- 'RESULTS'
6/24 08:44:40 [18272] GAHP -> 'S' '1'
6/24 08:44:40 [18272] GAHP -> '61' '0' '0' '128'
6/24 08:44:40 [18272] (814.0) doEvaluateState called: gmState GM_STOP_AND_RESTART, globusState 128
6/24 08:44:40 [18272] (814.0) gm state change: GM_STOP_AND_RESTART -> GM_RESTART
6/24 08:44:40 [18272] GAHP <- 'GRAM_JOB_REQUEST 62 mercury.uvic.ca:2119/jobmanager-pbs https://gcgate01.phys.uvic.ca:43460/ 1 &(rsl_substitution=(GRIDMANAGER_GASS_URL\ https://gcgate01.phys.uvic.ca:43461))(restart=https://mercury.uvic.ca:40023/30935/1088091305/)(remote_io_url=$(GRIDMANAGER_GASS_URL))(proxy_timeout=240)'
6/24 08:44:40 [18272] GAHP -> 'S'
6/24 08:44:41 [18272] GAHP <- 'RESULTS'
6/24 08:44:41 [18272] GAHP -> 'S' '1'
6/24 08:44:41 [18272] GAHP -> '62' '124' 'https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState GM_RESTART, globusState 128
6/24 08:44:41 [18272] (814.0) gm state change: GM_RESTART -> GM_INIT
6/24 08:44:41 [18272] (814.0) gm state change: GM_INIT -> GM_REGISTER
6/24 08:44:41 [18272] GAHP <- 'GRAM_JOB_CALLBACK_REGISTER 63 https://mercury.uvic.ca:40023/30935/1088091305/ https://gcgate01.phys.uvic.ca:43460/'
6/24 08:44:41 [18272] GAHP -> 'S'
6/24 08:44:41 [18272] GAHP <- 'RESULTS'
6/24 08:44:41 [18272] GAHP -> 'S' '1'
6/24 08:44:41 [18272] GAHP -> '63' '0' '130' '4'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState GM_REGISTER, globusState 128
6/24 08:44:41 [18272] (814.0) globus state change: STAGE_OUT -> FAILED
6/24 08:44:41 [18272] (814.0) gm state change: GM_REGISTER -> GM_STDIO_UPDATE
6/24 08:44:41 [18272] GAHP <- 'GRAM_JOB_SIGNAL 64 https://mercury.uvic.ca:40023/30935/1088091305/ 7 &(remote_io_url=https://gcgate01.phys.uvic.ca:43461)(invalid=bad)(invalid=bad)'
6/24 08:44:41 [18272] GAHP -> 'S'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4
6/24 08:44:41 [18272] in doContactSchedd()
6/24 08:44:41 [18272] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/24 08:44:41 [18272] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
6/24 08:44:41 [18272] AUTHENTICATE_FS: used file /tmp/qmgr_QPq9At, status: 1
6/24 08:44:41 [18272] Updating classad values for 814.0:
6/24 08:44:41 [18272] leaving doContactSchedd()
6/24 08:44:41 [18272] GAHP <- 'RESULTS'
6/24 08:44:41 [18272] GAHP -> 'S' '1'
6/24 08:44:41 [18272] GAHP -> '64' '94' '0' '4'
6/24 08:44:41 [18272] (814.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4
6/24 08:44:41 [18272] (814.0) gm state change: GM_STDIO_UPDATE -> GM_SUBMITTED
6/24 08:44:41 [18272] (814.0) gm state change: GM_SUBMITTED -> GM_FAILED
6/24 08:44:41 [18272] (814.0) gm state change: GM_FAILED -> GM_RESTART
6/24 08:44:52 [18272] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 3366 is alive.
6/24 08:45:37 [18272] GAHP <- 'RESULTS'
6/24 08:45:37 [18272] GAHP -> 'S' '0'
6/24 08:45:38 [18272] checkResources(): 1 resources, 0 are down
6/24 08:45:41 [18272] (814.0) doEvaluateState called: gmState GM_RESTART, globusState 4
6/24 08:45:41 [18272] GAHP <- 'GRAM_JOB_REQUEST 65 mercury.uvic.ca:2119/jobmanager-pbs https://gcgate01.phys.uvic.ca:43460/ 1 &(rsl_substitution=(GRIDMANAGER_GASS_URL\ https://gcgate01.phys.uvic.ca:43461))(restart=https://mercury.uvic.ca:40023/30935/1088091305/)(remote_io_url=$(GRIDMANAGER_GASS_URL))(proxy_timeout=240)'
6/24 08:45:41 [18272] GAHP -> 'S'
6/24 08:45:41 [18272] GAHP <- 'RESULTS'
6/24 08:45:41 [18272] GAHP -> 'S' '1'
6/24 08:45:41 [18272] GAHP -> '65' '124' 'https://mercury.uvic.ca:40023/30935/1088091305/'
6/24 08:45:41 [18272] (814.0) doEvaluateState called: gmState GM_RESTART, globusState 4
6/24 08:45:41 [18272] (814.0) gm state change: GM_RESTART -> GM_INIT
6/24 08:45:41 [18272] (814.0) gm state change: GM_INIT -> GM_REGISTER
6/24 08:45:41 [18272] GAHP <- 'GRAM_JOB_CALLBACK_REGISTER 66 https://mercury.uvic.ca:40023/30935/1088091305/ https://gcgate01.phys.uvic.ca:43460/'
6/24 08:45:41 [18272] GAHP -> 'S'
6/24 08:45:41 [18272] GAHP <- 'RESULTS'
6/24 08:45:41 [18272] GAHP -> 'S' '1'
6/24 08:45:41 [18272] GAHP -> '66' '0' '130' '4'
6/24 08:45:41 [18272] (814.0) doEvaluateState called: gmState GM_REGISTER, globusState 4
6/24 08:45:41 [18272] (814.0) gm state change: GM_REGISTER -> GM_STDIO_UPDATE
6/24 08:45:41 [18272] GAHP <- 'GRAM_JOB_SIGNAL 67 https://mercury.uvic.ca:40023/30935/1088091305/ 7 &(remote_io_url=https://gcgate01.phys.uvic.ca:43461)(invalid=bad)(invalid=bad)'
6/24 08:45:41 [18272] GAHP -> 'S'
6/24 08:45:42 [18272] GAHP <- 'RESULTS'
6/24 08:45:42 [18272] GAHP -> 'S' '1'
6/24 08:45:42 [18272] GAHP -> '67' '94' '0' '4'
6/24 08:45:42 [18272] (814.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4
6/24 08:45:42 [18272] (814.0) gm state change: GM_STDIO_UPDATE -> GM_SUBMITTED
6/24 08:45:42 [18272] (814.0) gm state change: GM_SUBMITTED -> GM_FAILED
6/24 08:45:42 [18272] (814.0) gm state change: GM_FAILED -> GM_RESTART


I would have sent whole log files, but the problem only appears when multiple jobs are run at once, so log files get quite big and they exceeded the maximum size allowed for posting to this list. Let me know if it would help to have them and I'll send them offline.

-Lila Klektau