[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] slow scheduling of dagman jobs




The limiting factor in speed of condor_submit completion is usually the time it takes to complete fsync() in the schedd.  If your jobs have user logs, prior to 7.5, condor_submit also called fsync().

To confirm whether this is your problem, you can start a long series of condor_submit invocations.  While those are running, periodically run gstack <schedd_pid>.  This will print out the schedd stack.  Is fsync() frequently listed?

Things you can do to speed up fsync: put $(SPOOL) on a fast disk that doesn't have a lot of other usage.  For testing, you can even stick it in /dev/shm, which is basically a ramdisk.  The same applies to user logs.

--Dan

On 9/8/11 3:32 PM, Patty Bragger wrote:
Thanks David,
I've tried adding the -disable flag, and it seems to help a little bit, but not a whole lot.   It's now averaging about 10 seconds per 100 instead of 11 seconds.

So this is still a pretty stark difference in performance from what you're seeing, and granted, my 4 core machine is probably pretty weak compared to a 16 core nahalem, but I guess I was still expecting to see some kind explanation by way of maxed out cpu, or something.. but I'm not seeing that at all.  I submitted 1200 jobs, just to sustain the "load" for a noticeable time of 2+ minutes.  During that time, the load average didn't even break 1, and the cpu usage increased from about 10% to about 35%.

Oh well, this isn't the end of the world, thanks for all of the info.

-Patty

On Thu, Sep 8, 2011 at 3:37 PM, David J. Herzfeld <herzfeldd@xxxxxxxxx> wrote:
On Thu, 2011-09-08 at 15:23 -0400, David J. Herzfeld wrote:
> Hi Patty:
>
> On Thu, 2011-09-08 at 14:40 -0400, Patty Bragger wrote:
> > So an average of about 9 jobs/sec, which is faster (but only a little) than
> > submitting through dag.  What kind of rates are you guys getting?  Maybe
> > this is this normal?
> >
>
> My guess is that the numbers you are seeing and probably pretty normal
> (both for dagman and when calling directly from the command line).
>
> We see faster times (real = 0m2.454s, user = 0m1.315s, ~40 jobs/s), but
> have a pretty customized config. For instance, we set
> SUBMIT_SKIP_FILECHECKS = False
> SUBMIT_SEND_RESCHEDULE = False
> I would assume that both of these knobs would reduce submit times
> (although haven't tested them myself).

Sorry, that should be:
SUBMIT_SKIP_FILECHECKS = True
(see
> http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#SECTION004314000000000000000).
Sorry about that.

You should be able to emulate this behavior with the -disable flag to
condor_submit (if you want to try to see if that increases your speed).

Best of luck,
DJH

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/