Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_rm on DAG job

Date: Thu, 24 Jun 2010 17:13:10 -0400
From: Ian Stokes-Rees <ijstokes@xxxxxxxxxxxxxxxxxxx>
Subject: [Condor-users] condor_rm on DAG job

I just watched a DAG take 63 minutes to remove the last DAG node after a condor_rm command. The DAG nodes are running on remote resources, but using the new glideinWMS system from FNAL/UCSD.

Is there some way to have condor_rm finish more quickly? The full DAG had 100k nodes, but we have a configuration setting that no more than 1000 nodes for a DAG will be idle, so there were about 1000 queued (idle) jobs in our local job pool, and maybe 100 running when the condor_rm command was made. 1 hour seems like a long time to remove 100 jobs. You can see the logs if you like here:

http://glidein.nebiogrid.org/~ijstokes/phaser/clean/3cqg/config_old/

50 jobs from the overall DAG finished -- the latest one around 1 hour after the condor_rm command was issued.

The problem we see when this happens is the following:

3pm: condor_submit_dag job.dag
4pm: discover mistake, execute condor_rm dag.jobid
4:10pm: fix script or classad or DAG, resubmit DAG
4:11pm: oops, rescue DAG exists, and log files. Delete these, resubmit

A - 4:12pm: hey, the log files are still there! the DAG nodes are still running and writing to them (thus re-creating them).

B - 5:00pm: discover that old jobs are still running and have now mixed their job output with new jobs and DAG.

Scenario B at the end there is what I've just witnessed, but I'd swear I've also seen scenario A before as well.

What advice do people have for completing a condor_rm in <15 minutes?

Ian

-- 
Ian Stokes-Rees, PhD                       W: http://hkl.hms.harvard.edu
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
NEBioGrid, Harvard Medical School          C: +1 617 331-5993

Follow-Ups:
- Re: [Condor-users] condor_rm on DAG job
  - From: Steven Timm

Prev by Date: Re: [Condor-users] DAGMan memory consumption
Next by Date: Re: [Condor-users] condor_rm on DAG job
Previous by thread: Re: [Condor-users] DAGMan memory consumption
Next by thread: Re: [Condor-users] condor_rm on DAG job
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] condor_rm on DAG job