[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_rm on DAG job



I just watched a DAG take 63 minutes to remove the last DAG node after a condor_rm command.  The DAG nodes are running on remote resources, but using the new glideinWMS system from FNAL/UCSD.

Is there some way to have condor_rm finish more quickly?  The full DAG had 100k nodes, but we have a configuration setting that no more than 1000 nodes for a DAG will be idle, so there were about 1000 queued (idle) jobs in our local job pool, and maybe 100 running when the condor_rm command was made.  1 hour seems like a long time to remove 100 jobs.  You can see the logs if you like here:

http://glidein.nebiogrid.org/~ijstokes/phaser/clean/3cqg/config_old/

50 jobs from the overall DAG finished -- the latest one around 1 hour after the condor_rm command was issued.

The problem we see when this happens is the following:

3pm: condor_submit_dag job.dag
4pm: discover mistake, execute condor_rm dag.jobid
4:10pm: fix script or classad or DAG, resubmit DAG
4:11pm: oops, rescue DAG exists, and log files.  Delete these, resubmit

A - 4:12pm: hey, the log files are still there!  the DAG nodes are still running and writing to them (thus re-creating them).

B - 5:00pm: discover that old jobs are still running and have now mixed their job output with new jobs and DAG.

Scenario B at the end there is what I've just witnessed, but I'd swear I've also seen scenario A before as well.

What advice do people have for completing a condor_rm in <15 minutes?

Ian
-- 
Ian Stokes-Rees, PhD                       W: http://hkl.hms.harvard.edu
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
NEBioGrid, Harvard Medical School          C: +1 617 331-5993