[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] slow scheduling of dagman jobs



Hi everyone,
I'm running into a performance issue of sorts with submitting dagman jobs.  When submitting a dagman job of say 100 nodes, I find that it takes quite a wile for all 100 nodes to show up in the queue.  After an initial wait of about 12 seconds, the nodes are added to the queue at a rate of about 7 per second. The nodes have no dependencies on each other, they are completely stand alone and could be submitted without using dag.  When I do submit jobs without using dag, the jobs are added to the queue much faster, about 100/second.  I can get that submission rate whether submitting one job with a "queue 100"  or submitting 100 separate jobs in one submit file.

here are the relevant config settings:
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 250
DAGMAN_SUBMIT_DELAY = 0
DAGMAN_USER_LOG_SCAN_INTERVAL = 5
(at least I think that's all of them)

I have two condor environments set up, both running condor 7.4.4, one has a shared central manager and scheduler (just for testing), and one has a dedicated central manager and multiple schedulers (for production use).  This behavior is present in both environments.  The testing environment averages about 7 nodes per second, the production environment averages closer to 5 nodes per second.  During the time that the nodes are being added to the queue, there is no noticable increase in load or memory usage or IO wait on the scheduler nor the central manager (this is true for both environments.)  Also, the jobs have notifications set to never and I have set some of the various *_DEBUG values to the least taxing settings (I think?)

condor_config_val -dump | grep -i debug 
ALL_DEBUG =
COLLECTOR_DEBUG =
CREDD_DEBUG = D_FULLDEBUG
GRIDMANAGER_DEBUG =
HAD_DEBUG =
HDFS_DEBUG =
JOB_ROUTER_DEBUG =
KBDD_DEBUG =
LEASEMANAGER.DEBUG_ADS = False
LEASEMANAGER_DEBUG = D_FULLDEBUG
MASTER_DEBUG =
NEGOTIATOR_DEBUG = D_MATCH
REPLICATION_DEBUG =
ROOSTER_DEBUG =
SCHEDD_DEBUG =
SHADOW_DEBUG =
STARTD_DEBUG =
STARTER_DEBUG = D_NODATE
STORK_DEBUG = D_FULLDEBUG
TRANSFERER_DEBUG =


Has anyone else seen performance like this or does anyone know how to figure out what is taking it so long to dispatch these nodes to the queue?

Thanks,
Patty