[Condor-users] condor_rm on DAG jobs a little flaky. tips?

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

I have a big DAG job I run on a Windows pool, and sometimes I want to cancel it midway through. If the cluster_id of the DAG job itself is 2054, issuing 'condor_rm 2054' seems like it's supposed to clean things up, but I'm having problems. It seems to screw up quite often in different ways. I get errors like "Couldn't find/remove all jobs in cluster 2054". I get jobs stuck in the "X state" even though this all on a LAN and I can see that nothing is left running. Sometimes jobs are left stuck permanently in the "'I' state" but then condor_release on the job fails. Also, sometimes I get ghost condor_shadow processes on the submit machine even though condor_q is empty and there is clearly nothing left running in the pool. I have to manually kill the condor_shadow processes.

Is there a better way to terminate a DAG job? Some sort of constraint argument to condor_rm with the cluster id?

Thanks.

Mailing List Archives

Public Access

[Condor-users] condor_rm on DAG jobs a little flaky. tips?