Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] My dag is frozen

Date: Fri, 18 Jul 2008 08:32:17 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] My dag is frozen

On Fri, 18 Jul 2008, Lucia Santamaria wrote:

for the last 4 days my dagman in morgane seems frozen and doesn't trigger
more jobs. The dag is not yet finished, as you can see if you execute
...
and also, _no_ rescue dag is created. If I look at one of the
dag.dagman.out corresponding to one of the subdags that are not yet
finished (for instance, cat2 dags in nsbhinj):
...
7/18 14:15:26 319886 seconds since last log event
7/18 14:15:26 Pending DAG nodes:
7/18 14:15:26   Node 20abc05cccfa0bf1b7e41fa441b90524, Condor ID 169496,
status STATUS_SUBMITTED
7/18 14:25:26 320486 seconds since last log event
...

I've seen something like that once before. At that time, the cause wasthat the file descriptor for a node job's user log file somehowbecame disconnected from the actual file, without creating any errors whenit was read -- it just never reported any more bytes available (butpoking around in /proc/*/fd revealed some problems). Thatmight be what's happening now. (BTW, are your user log files on a localfilesystem? I vaguely remember that in the previous case the user logfiles may have been on a shared filesystem.)

Anyhow, if you do a condor_hold and then a condor_release on the "stuck"condor_dagman(s), I think that will fix things. (Hopefully you arerunning the 7.1.1 pre-release DAGMan, which has the "fast recovery" fix.)


Kent Wenger
Condor Team

Follow-Ups:
- Re: [Condor-users] My dag is frozen
  - From: Lucia Santamaria

References:
- Re: [Condor-users] jobs don't run in parallel when submited using condor_submit
  - From: Yogesh Aher
- Re: [Condor-users] jobs don't run in parallel when submited using condor_submit
  - From: Greg Thain
- [Condor-users] My dag is frozen
  - From: Lucia Santamaria

Prev by Date: [Condor-users] My dag is frozen
Next by Date: [Condor-users] DC_AUTHENTICATE: sent DC_INVALIDATE
Previous by thread: [Condor-users] My dag is frozen
Next by thread: Re: [Condor-users] My dag is frozen
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] My dag is frozen