[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm on DAG fails to remove queued DAG nodes

On Tue, 6 Jul 2010, Ian Stokes-Rees wrote:

I think I asked about this once before, but I don't recall (and don't
have in my email archives) any answer: sometimes when I do a condor_rm
on a DAG JID the DAG is removed but the queued jobs are not.

If I had to guess, I'd say it is a concurrency issue caused by a
recently submitted DAG that is actively spawning DAG node jobs.  The DAG
is removed, all nodes of that DAG that exist at that instant are
(possibly) removed, but any DAG nodes that are in the process of being
instantiated/created still come into existence.

Here are some log file excerpts.  You can see I held the job, then
realized that I actually had to remove it, so I did this 3 minutes later.

Hmm, if you condor_hold the DAG, and then condor_rm it while it's on hold, this is pretty much what I'd expect to happen.

I guess we should deal with this situation better, but in the mean time, if you have a DAG that's on hold and then you decide to remove it, release it first and then remove it. To be really safe, you might want to look at the dagman.out file and make sure it's done bootstrapping before you condor_rm it.

Kent Wenger
Condor Team