[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor-g jobs failing - stuck in STAGE_OUT



HI,

We're using condor-g to submit to 3 different grid resources. For two of these resources, jobs submit and run fine. However, for one of these resources a significant amount of jobs get stuck in the STAGE_OUT status. All the resources themselves are running pbs. For jobs that get stuck, pbs indicates that they have successfully finished executing. I think it's a problem of two or more jobs finishing at the same time, as when I schedule test jobs that end at different times, I don't notice the problem.

We've tried running jobs against the problem site from different instances of condor-g - some work fine and others don't (it's worked with 6.6.1 and 6.6.3, but we've encountered problems with a different 6.6.3 instance and with 6.6.5). All sites are using globus 2.4.3.

In all the condor_config files, we have:
   GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 1
   GRID_MONITOR = $(SBIN)/grid_monitor.sh
   ENABLE_GRID_MONITOR = True

the GridmanagerLog file shows this:
6/21 09:59:23 [11400] (694.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128
6/21 09:59:23 [11400] (694.0) doEvaluateState called: gmState GM_PROBE_JOBMANAGER, globusState 128
6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState GM_SUBMITTED, globusState 128
6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState GM_REFRESH_PROXY, globusState 128
6/21 10:03:06 [11400] (694.0) gmState GM_REFRESH_PROXY, globusState 128: refresh_credentials() returned Globus error 10
6/21 10:03:06 [11400] (694.0) doEvaluateState called: gmState GM_STOP_AND_RESTART, globusState 128
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_RESTART, globusState 128
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_REGISTER, globusState 128
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4
6/21 10:03:07 [11400] (694.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4
and then the following lines repeated every minute:
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_RESTART, globusState 4
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_RESTART, globusState 4
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_REGISTER, globusState 4
6/21 10:04:07 [11400] (694.0) doEvaluateState called: gmState GM_STDIO_UPDATE, globusState 4


Here's the error from the original gram_job_mgr file on remote resource:
6/21 09:59:23 globus_gram_job_manager_query_callback() not a literal URI match
6/21 09:59:23 JM : in globus_l_gram_job_manager_query_callback, query=status
6/21 09:59:23 JM : reply: (status=128 failure code=0 (Success))
6/21 09:59:23 JM : sending reply:
protocol-version: 2
status: 128
failure-code: 0
job-failure-code: 0
6/21 09:59:23 -------------------


For every time the 4 lines in the GridmanagerLog file are repeated, a new gram_job_mgr file is created on the remote resource which tries to restart the job, but fails with the following error:
6/21 10:03:06 JM: State lock file is locked, old jm is still alive




Processes still running on the remote resource are:
gcprod05 29693 0.0 0.1 5288 3336 ? S 09:54 0:00 globus-job-manager -conf /usr/pkg/src/globus-toolkit-2.4.3/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs -machine-type unknown -publish-jobs
gcprod05 31268 0.0 0.1 5196 3776 ? S 09:55 0:00 /usr/bin/perl /usr/pkg/src/globus-toolkit-2.4.3/libexec/globus-job-manager-script.pl -m pbs -f /tmp/gram_d6KDpT -c stage_out
gcprod05 31320 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31321 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31322 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31323 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data
gcprod05 31329 0.0 0.1 10664 2996 ? S 09:55 0:00 /usr/pkg/src/globus-toolkit-2.4.3/bin/globus-url-copy file:///home1x/gcprod/gcprod05/.globus/.gass_cache/local/md5/fe/08fb57ce42a6cf460df356f86d3217/md5/84/71edf9ea00d6dbf74df5dc6b303e15/data https://gcgate01.phys.uvic.ca:34812/home/gcprod05/.globus/.gass_cache/local/md5/df/f9f6c77e8acb7d53888f7bb22612d3/md5/e2/ac0fd33d302908e03f54adf25bbda7/data


Processes still running on the condor-g resource are:
gcprod05 11400 3366 0 09:53 ? 00:00:00 condor_gridmanager -f -C (Owner=?="gcprod05"&&x509userproxysubject=?="/C=CA/O=Grid/OU=phys.uvic.ca/CN=Lila_Klektau/CN=proxy/CN=proxy/CN=proxy") -S /tmp/condor_g_scratch.0x83b9890.3366
gcprod05 11401 11400 0 09:53 ? 00:00:04 /opt/condor-6.6.5/sbin/gahp_server
gcprod05 11858 1 0 09:53 ? 00:00:00 globus-job-manager -conf /home/globus/globus-2.4.3//etc/globus-job-manager.conf -type condorg -rdn jobmanager-condorg -machine-type unknown -publish-jobs


netstat on remote resource shows this:
tcp 0 0 mercury.uvic.ca:40033 gcgate01.phys.UVi:35280 TIME_WAIT
tcp 0 0 mercury.uvic.ca:40033 gcgate01.phys.UVi:35279 TIME_WAIT
tcp 1 0 mercury.u:gsigatekeeper gcgate01.phys.UVi:35078 CLOSE_WAIT


netstat on condor-g resource shows this:
tcp 0 0 gcgate01.phys.UVi:35275 mercury.u:gsigatekeeper TIME_WAIT


When jobs do get stuck, the only way to fix things is to ssh to the remote resource and explicitly kill the processes still running, then to manually remove jobs from condor and all the log files. We didn't notice jobs hanging with globus until condor-g was introduced into the submission process.

Has this problem been encountered before? Do you know if there are any patches available for it?

Thanks for any help

-Lila