[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] My dag is frozen



On Fri, 18 Jul 2008, Lucia Santamaria wrote:

for the last 4 days my dagman in morgane seems frozen and doesn't trigger
more jobs. The dag is not yet finished, as you can see if you execute
...
and also, _no_ rescue dag is created. If I look at one of the
dag.dagman.out corresponding to one of the subdags that are not yet
finished (for instance, cat2 dags in nsbhinj):
...
7/18 14:15:26 319886 seconds since last log event
7/18 14:15:26 Pending DAG nodes:
7/18 14:15:26   Node 20abc05cccfa0bf1b7e41fa441b90524, Condor ID 169496,
status STATUS_SUBMITTED
7/18 14:25:26 320486 seconds since last log event
...

I've seen something like that once before. At that time, the cause was that the file descriptor for a node job's user log file somehow became disconnected from the actual file, without creating any errors when it was read -- it just never reported any more bytes available (but poking around in /proc/*/fd revealed some problems). That might be what's happening now. (BTW, are your user log files on a local filesystem? I vaguely remember that in the previous case the user log files may have been on a shared filesystem.)

Anyhow, if you do a condor_hold and then a condor_release on the "stuck" condor_dagman(s), I think that will fix things. (Hopefully you are running the 7.1.1 pre-release DAGMan, which has the "fast recovery" fix.)

Kent Wenger
Condor Team