[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm on DAG job



On Thu, 24 Jun 2010, Ian Stokes-Rees wrote:

I just watched a DAG take 63 minutes to remove the last DAG node after a
condor_rm command.  The DAG nodes are running on remote resources, but
using the new glideinWMS system from FNAL/UCSD.

Is there some way to have condor_rm finish more quickly?

You should be aware that if it *does* finish more quickly
than that, you will probably break something--whether it is your local schedd or the remote site depends on the details. Recent developments
have made condor_rm go slower, not faster. Particularly when you
are running in the glideinWMS with glexec , every condor_rm call makes
two glexec calls on the backend, one to stop the job and one
to remove the directory in which it was running, which can cause
a denial of service attack if it's not properly executed.

Are you using condor 7.4.1 or greater--if so then there is a
job_stop_count and a job_stop_delay that can be configured.

The full DAG
had 100k nodes, but we have a configuration setting that no more than
1000 nodes for a DAG will be idle, so there were about 1000 queued
(idle) jobs in our local job pool, and maybe 100 running when the
condor_rm command was made.  1 hour seems like a long time to remove 100
jobs.  You can see the logs if you like here:

http://glidein.nebiogrid.org/~ijstokes/phaser/clean/3cqg/config_old/

50 jobs from the overall DAG finished -- the latest one around 1 hour
after the condor_rm command was issued.

The problem we see when this happens is the following:

3pm: condor_submit_dag job.dag
4pm: discover mistake, execute condor_rm dag.jobid

4:09 pm--do condor_q to make sure everything is really gone next
time, also maybe a ps -ef to check.


4:10pm: fix script or classad or DAG, resubmit DAG
4:11pm: oops, rescue DAG exists, and log files.  Delete these, resubmit

A - 4:12pm: hey, the log files are still there!  the DAG nodes are still
running and writing to them (thus re-creating them).

B - 5:00pm: discover that old jobs are still running and have now mixed
their job output with new jobs and DAG.

Scenario B at the end there is what I've just witnessed, but I'd swear
I've also seen scenario A before as well.

What advice do people have for completing a condor_rm in <15 minutes?

Test on 1% of dag size so you don't have to condor_rm 100K-job dags so much. This mode of operation that you are doing is creating havoc on
sites all across the OSG and has left some gatekeepers in a
dazed and confused state for days.

Or--do like CDF does and use one schedd to handle dagman and
another one (or more than one) to handle the actual jobs.
I've never seen any condor_schedd handle 10k+ jobs simultaneously
of whatever universe without grief and heartache, whether running
or waiting.

Steve Timm


Ian



--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.