[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm on DAG jobs a little flaky. tips?



On Mon, 22 Aug 2011, Rowe, Thomas wrote:

I have a big DAG job I run on a Windows pool, and sometimes I want to
cancel it midway through. If the cluster_id of the DAG job itself is
2054, issuing 'condor_rm 2054' seems like it's supposed to clean things
up, but I'm having problems. It seems to screw up quite often in
different ways. I get errors like "Couldn't find/remove all jobs in
cluster 2054". I get jobs stuck in the "X state" even though this all on
a LAN and I can see that nothing is left running. Sometimes jobs are
left stuck permanently in the "'I' state" but then condor_release on the
job fails. Also, sometimes I get ghost condor_shadow processes on the
submit machine even though condor_q is empty and there is clearly
nothing left running in the pool. I have to manually kill the
condor_shadow processes.

Is there a better way to terminate a DAG job? Some sort of constraint
argument to condor_rm with the cluster id?

Condor_rm'ing the DAGMan job itself is the preferred way to do this -- that is, in fact, supposed to cleanly remove all of the node jobs.

Can you send the relevant dagman.out file, and the SchedLog file (if it still has information for the relevant range of time?)

Kent Wenger
Condor Team