[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] properly removing/stopping a dag and all its nodes?



On Tue, 5 Jul 2011, Rowe, Thomas wrote:

I had assumed that issuing "condor_rm 267", where 267 is the cluster
of a condor_dagman.exe job, would cleanly terminate all outstanding
nodes of the DAG. Instead there a bunch of
jobs left according to condor_q and I have to use -forcex to remove
them. Also, condor_status indicates many "State: Claimed; Activity:
Idle" slots. I have to "condor_restart -all" to clean
them up.

OK, setting "UWCS_CLAIM_WORKLIFE = 0" makes the cancelled nodes abandon
slots right away. But I get loads of nodes stuck in the 'X' state and
the corresponding condor_shadow processes never exit. I have to manually
kill the condor_shadow processes.

What am I doing wrong?

What happens if you manually run condor_rm on one of the node jobs as opposed to the DAGMan job itself? (That's basically the same thing that DAGMan does.) My guess at this point is that the problems have something to do with the jobs themselves, or the configuration of your pool, rather than the fact that they're managed by DAGMan.

Does your dagman.out file show any problems when DAGMan tried to remove the node jobs? (Look for the string "Error removing DAGMan jobs".)

Kent Wenger
Condor Team