[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT

On Thu, 24 Jun 2004 15:49:53 -0500 (CDT), Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Thu, 24 Jun 2004, Lila Klektau wrote:

On Wed, 23 Jun 2004 21:53:51 -0500, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

> It looks like the globus-url-copy to stage out the job files is hanging
> on the gatekeeper machine. We've seen this problem with globus-url-copy
> in other situations, but haven't been able to determine the cause. If
> you could add/change the following line to your condor_config file,
> reproduce the problem, and send me the resulting gridmanager log file,
> it'd be a great help in figuring out how to make condor-g better at
> working around this problem:

Thanks for the reply,

I've been trying to recreate the problem, with full debugging, but I'm
noticing two different outputs. In one case, the job is not attempting to
restart and netstat shows no connections on the condor-g machine to the
remote resource, but the remote resource shows CLOSE_WAIT connections with
the condor-g machine. This is the tail end of the GridmanagerLog file for

...<log file>

If I let it keep going, similar messages will just be repeated.

In this case, the globus jobmanager is replying properly to the condor gridmanager's periodic probes, so the gridmanager patiently waits for the jobmanager to say it's down staging out the files.

However, in other cases a job restart is attempted (this is the one I have
noticed many times before, where netstat shows connections on both sides
and a new log file on remote resource is created every minute). This is
the corresponding tail end of the GridmanagerLog file:

...<log file>

Here, the x509 proxy has been refreshed, so the gridmanager tries to
forward the new proxy to the jobmanager. The jobmanager replies with gram
error 10 (PROTOCOL_FAILED), so the gridmanager tries to stop and restart
it hoping that'll clear up whatever's wrong. The jobmanager acknowledges
the stop request, but won't actually stop until globus-url-copy completes.
The gridmanager tries to restart the jobmanager, which fails as the old
one hasn't exitted yet. The gridmanager ends up in a loop trying to talk
to the old jobmanager which says it's just about to quit and trying to
start a new jobmanager that fails because the old one hasn't actually quit

I would have sent whole log files, but the problem only appears when
multiple jobs are run at once, so log files get quite big and they
exceeded the maximum size allowed for posting to this list.  Let me know
if it would help to have them and I'll send them offline.

The excerpts you've sent are enough to diagnose the problem. The additional piece that would be useful to see is the logfile for the jobmanager process that won't die. I think I know why the proxy refresh command fails, but the log will hopefully confirm my suspicion.

I've attached the jobmanager log file.

Even if that confirms your proxy suspicions, do you have any idea why the transfer would be hanging in the first place?

-Lila Klektau

Attachment: gram_job_mgr_26692.log
Description: Binary data