[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] properly removing/stopping a dag and all its nodes?



>>> I had assumed that issuing "condor_rm 267", where 267 is the cluster
of a
>>> condor_dagman.exe job, would cleanly terminate all outstanding nodes
of the
>>> DAG. Instead there a bunch of jobs left according to condor_q and I
have to
>>> use -forcex to remove them. Also, condor_status indicates many
"State:
>>> Claimed; Activity: Idle" slots. I have to "condor_restart -all" to
clean
>>> them up.
>>
>> OK, setting "UWCS_CLAIM_WORKLIFE = 0" makes the cancelled nodes
abandon slots
>> right away. But I get loads of nodes stuck in the 'X' state and the
>> corresponding condor_shadow processes never exit. I have to manually
kill the
>> condor_shadow processes.
>>
>> What am I doing wrong?
>
> What happens if you manually run condor_rm on one of the node jobs as
opposed
> to the DAGMan job itself?  (That's basically the same thing that
DAGMan does.)
> My guess at this point is that the problems have something to do with
the jobs
> themselves, or the configuration of your pool, rather than the fact
that
> they're managed by DAGMan.
> 
> Does your dagman.out file show any problems when DAGMan tried to
remove the
> node jobs?  (Look for the string "Error removing DAGMan jobs".)

The logs didn't indicate an error.

The problem only appears when nodes have very small run times, so it
turns out
not actually be a problem in my case. I was hitting it while using a
mocked-up
workload for testing purposes where the DAG node execution time would be
on the
order of a few seconds. When I switched it to using more realistic data
inputs
(30min+ runtimes), the problem disappeared; removing the dag job cleanly
removed all executing nodes.

I can only assume there is some sort of condor bug in the case of very
quick
jobs. condor_rm must wind up getting confused across hosts about what's
actually still running.  Whether it has anything to do with DAGs I don't
know.