On Fri, 18 Jul 2008, Lucia Santamaria wrote:
for the last 4 days my dagman in morgane seems frozen and doesn't trigger more jobs. The dag is not yet finished, as you can see if you execute ... and also, _no_ rescue dag is created. If I look at one of the dag.dagman.out corresponding to one of the subdags that are not yet finished (for instance, cat2 dags in nsbhinj): ... 7/18 14:15:26 319886 seconds since last log event 7/18 14:15:26 Pending DAG nodes: 7/18 14:15:26 Node 20abc05cccfa0bf1b7e41fa441b90524, Condor ID 169496, status STATUS_SUBMITTED 7/18 14:25:26 320486 seconds since last log event ...
I've seen something like that once before. At that time, the cause was that the file descriptor for a node job's user log file somehow became disconnected from the actual file, without creating any errors when it was read -- it just never reported any more bytes available (but poking around in /proc/*/fd revealed some problems). That might be what's happening now. (BTW, are your user log files on a local filesystem? I vaguely remember that in the previous case the user log files may have been on a shared filesystem.)
Anyhow, if you do a condor_hold and then a condor_release on the "stuck" condor_dagman(s), I think that will fix things. (Hopefully you are running the 7.1.1 pre-release DAGMan, which has the "fast recovery" fix.)
Kent Wenger Condor Team