Re: [condor-users] Dagman stalling with shadow exception messages?

Michael S. Root wrote:

The only workaround seems to be to delete the dag job from the queue and re-submit the remaining jobs (which then proceed to run fine).

Do you mean that you are having to manually submit each of the remaining jobs? DAGMan should be creating a rescue DAG when you remove it from the queue (with condor_rm). You can run the rescue DAG and DAGMan will submit jobs that were not successfully finished in the first attempt.

Of course, the real problem is why the DAG is not completing in the first place, but I just want to make sure everything else is sane. If DAGMan is in some crazy state where it can't even generate the rescue DAG, then this is an important point.

4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
4/6 21:00:02 Error: can't find resource with capability (<>#7698602094)
Note: That last line puzzles me. I don't know what the #7698602094 referrs to.

This is perfectly normal (both the message and the puzzlement). Glancing at your two log files, it looks to me like the times don't match up, so we can't see what happened on the execution side when the shadow lost contact with the starter.

Whatever may have happened to cause the run attempt to fail, this shouldn't have caused DAGMan to get stuck, but if you are seeing a correlation, then there may be a problem.

Is there's any chance that the disk containing the job state log file(s) was ever full?


