[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT



On Thu, 24 Jun 2004, Lila Klektau wrote:

> On Wed, 23 Jun 2004 21:53:51 -0500, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
>
> > It looks like the globus-url-copy to stage out the job files is hanging
> > on the gatekeeper machine. We've seen this problem with globus-url-copy
> > in other situations, but haven't been able to determine the cause. If
> > you could add/change the following line to your condor_config file,
> > reproduce the problem, and send me the resulting gridmanager log file,
> > it'd be a great help in figuring out how to make condor-g better at
> > working around this problem:
> >
> > GRIDMANAGER_DEBUG = D_FULLDEBUG
>
> Thanks for the reply,
>
> I've been trying to recreate the problem, with full debugging, but I'm
> noticing two different outputs.  In one case, the job is not attempting to
> restart and netstat shows no connections on the condor-g machine to the
> remote resource, but the remote resource shows CLOSE_WAIT connections with
> the condor-g machine.  This is the tail end of the GridmanagerLog file for
> that:
>
...<log file>
>
> If I let it keep going, similar messages will just be repeated.

In this case, the globus jobmanager is replying properly to the condor
gridmanager's periodic probes, so the gridmanager patiently waits for the
jobmanager to say it's down staging out the files.

> However, in other cases a job restart is attempted (this is the one I have
> noticed many times before, where netstat shows connections on both sides
> and a new log file on remote resource is created every minute).  This is
> the corresponding tail end of the GridmanagerLog file:
>
...<log file>

Here, the x509 proxy has been refreshed, so the gridmanager tries to
forward the new proxy to the jobmanager. The jobmanager replies with gram
error 10 (PROTOCOL_FAILED), so the gridmanager tries to stop and restart
it hoping that'll clear up whatever's wrong. The jobmanager acknowledges
the stop request, but won't actually stop until globus-url-copy completes.
The gridmanager tries to restart the jobmanager, which fails as the old
one hasn't exitted yet. The gridmanager ends up in a loop trying to talk
to the old jobmanager which says it's just about to quit and trying to
start a new jobmanager that fails because the old one hasn't actually quit
yet.

> I would have sent whole log files, but the problem only appears when
> multiple jobs are run at once, so log files get quite big and they
> exceeded the maximum size allowed for posting to this list.  Let me know
> if it would help to have them and I'll send them offline.

The excerpts you've sent are enough to diagnose the problem. The
additional piece that would be useful to see is the logfile for the
jobmanager process that won't die. I think I know why the proxy refresh
command fails, but the log will hopefully confirm my suspicion.

+----------------------------------+---------------------------------+
|            Jaime Frey            | I stayed up all night playing   |
|        jfrey@xxxxxxxxxxx         | poker with tarot cards. I got a |
|  http://www.cs.wisc.edu/~jfrey/  | full house and four people died.|
+----------------------------------+---------------------------------+