We currently need to use HTCondor to run a large number (order 10k) of short jobs (taking approximately 10 seconds). I believe that HTCondor is not really designed for this, but these jobs are an adaptation of older jobs which take order minutes, against a new, more split, dataset, so we still need the resource management provided by HTCondor.Â
I've had some fairly large issues getting tests of this to run with reasonable times, so was wondering if there are any settings/configuration which I should be looking at to improve this issue.
Current condor version: 8.5.1, all systems on Ubuntu 14.04.Â
All jobs using vanilla universe. We have a single manager, which is used as the SCHEDD, COLLECTOR, NEGOTIATOR and then 5 STARTD nodes.Â
Steps to reproduce:
Set up a dag containing 10,000 jobs, labelled "JOB x test.sub"
executable = /bin/sleep
arguments = 1
universe = vanilla
transfer_executable = false
requirements = TARGET.Machine == "<machine with 48 slots>"
Submit that dag.Â
The real processing time of these jobs should be 10,000s / 48 slots which is under 3.5 minutes.Â
However, this dag takes approximately 30 mins to complete, meaning that the overhead for this (albeit extreme) example is around 900% of real processing time.Â
We currently have DAGMAN_MAX_SUBMITS_PER_INTERVAL set at 200, but this doesn't seem to be the issue, since jobs are in the schedd queue, they are just not taking the expecting 1s to run. Instead we are seeing run times of up to 9 seconds.Â
We see the same issue by changing the above submit file to queue 10000 and submitting that.Â
Please, could someone explain what is going on here which is taking so long? I would certainly expect some overhead, but this seems very high to me. If anyone has any suggestions on what to try to reduce this, then it would be greatly appreciated!Â