[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_rm on DAG jobs a little flaky. tips?



I have a big DAG job I run on a Windows pool, and sometimes I want to cancel it midway through. If the cluster_id of the DAG job itself is 2054, issuing 'condor_rm 2054' seems like it's supposed to clean things up, but I'm having problems. It seems to screw up quite often in different ways. I get errors like "Couldn't find/remove all jobs in cluster 2054". I get jobs stuck in the "X state" even though this all on a LAN and I can see that nothing is left running. Sometimes jobs are left stuck permanently in the "'I' state" but then condor_release on the job fails. Also, sometimes I get ghost condor_shadow processes on the submit machine even though condor_q is empty and there is clearly nothing left running in the pool. I have to manually kill the condor_shadow processes.

 

Is there a better way to terminate a DAG job? Some sort of constraint argument to condor_rm with the cluster id?

 

Thanks.