Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT

Date: Thu, 24 Jun 2004 14:35:41 -0700
From: "Lila Klektau" <lmk@xxxxxxx>
Subject: Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT

On Thu, 24 Jun 2004 15:49:53 -0500 (CDT), Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Thu, 24 Jun 2004, Lila Klektau wrote:

On Wed, 23 Jun 2004 21:53:51 -0500, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

> It looks like the globus-url-copy to stage out the job files is hanging > on the gatekeeper machine. We've seen this problem with globus-url-copy > in other situations, but haven't been able to determine the cause. If > you could add/change the following line to your condor_config file, > reproduce the problem, and send me the resulting gridmanager log file, > it'd be a great help in figuring out how to make condor-g better at > working around this problem: > > GRIDMANAGER_DEBUG = D_FULLDEBUG

Thanks for the reply,

I've been trying to recreate the problem, with full debugging, but I'm noticing two different outputs. In one case, the job is not attempting to restart and netstat shows no connections on the condor-g machine to the remote resource, but the remote resource shows CLOSE_WAIT connections with the condor-g machine. This is the tail end of the GridmanagerLog file for that:

...<log file>
If I let it keep going, similar messages will just be repeated.
In this case, the globus jobmanager is replying properly to the condor
gridmanager's periodic probes, so the gridmanager patiently waits for the
jobmanager to say it's down staging out the files.
However, in other cases a job restart is attempted (this is the one I have noticed many times before, where netstat shows connections on both sides and a new log file on remote resource is created every minute). This is the corresponding tail end of the GridmanagerLog file:

...<log file>

Here, the x509 proxy has been refreshed, so the gridmanager tries to forward the new proxy to the jobmanager. The jobmanager replies with gram error 10 (PROTOCOL_FAILED), so the gridmanager tries to stop and restart it hoping that'll clear up whatever's wrong. The jobmanager acknowledges the stop request, but won't actually stop until globus-url-copy completes. The gridmanager tries to restart the jobmanager, which fails as the old one hasn't exitted yet. The gridmanager ends up in a loop trying to talk to the old jobmanager which says it's just about to quit and trying to start a new jobmanager that fails because the old one hasn't actually quit yet.
I would have sent whole log files, but the problem only appears when
multiple jobs are run at once, so log files get quite big and they
exceeded the maximum size allowed for posting to this list.  Let me know
if it would help to have them and I'll send them offline.
The excerpts you've sent are enough to diagnose the problem. The
additional piece that would be useful to see is the logfile for the
jobmanager process that won't die. I think I know why the proxy refresh
command fails, but the log will hopefully confirm my suspicion.

I've attached the jobmanager log file.

Even if that confirms your proxy suspicions, do you have any idea why the transfer would be hanging in the first place?

-Lila Klektau

Attachment: gram_job_mgr_26692.log
Description: Binary data

References:
- [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
  - From: Lila Klektau
- Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
  - From: Jaime Frey
- Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
  - From: Lila Klektau
- Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
  - From: Jaime Frey

Prev by Date: Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
Next by Date: [Condor-users] Accumulated usage
Previous by thread: Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT
Next by thread: [Condor-users] MPI, Windows and non dedicated resources...
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] condor-g jobs failing - stuck in STAGE_OUT