[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] slow submission rate



Joe:

Thanks for the suggestion. I actually keep full debugging on most of the time. I found that the DagmanNodesLog file (typically ends in nodes.log) was being written to, on average, several times a second thought this came in bursts. And that this file was on NFS.

These values are all true which I think means that HTCondor will take care of file locking on its own separate from the NFS mechanism. Nevertheless, latency in file I/O could contribute to slow response times of condor_q.

[pcdev2:~] condor_config_val CREATE_LOCKS_ON_LOCAL_DISK ENABLE_USERLOG_LOCKING DAGMAN_LOG_ON_NFS_IS_ERROR
true
true
true

Here's an example of two quick writes to the node logs file, with some evidence of what I think is local file locking in /tmp

09/18/13 21:55:26 (pid:18131) Writing record to user logfile=/people/scaudill/logs/tmpIcIkFp owner=scaudill
09/18/13 21:55:26 (pid:18131) Writing record to user logfile=/home/scaudill/Projects/nsbh/aLIGO/MDCs/MDC1/gaussian_ihope_runs/full_run/966384015-971568015/nsbhtaylort2_two/inspiral_hipe_nsbhtaylort2_two.NSBHTAYLORT2_TWO.dag.nodes.log owner=scaudill
09/18/13 21:55:26 (pid:18131) dirscat: dirpath = /tmp
09/18/13 21:55:26 (pid:18131) dirscat: subdir = condorLocks
09/18/13 21:55:26 (pid:18131) FileLock object is updating timestamp on: /tmp/condorLocks/14/35/787671174006226.lockc
09/18/13 21:55:26 (pid:18131) WriteUserLog::initialize: opened /people/scaudill/logs/tmpIcIkFp successfully
09/18/13 21:55:26 (pid:18131) dirscat: dirpath = /tmp
09/18/13 21:55:26 (pid:18131) dirscat: subdir = condorLocks
09/18/13 21:55:26 (pid:18131) FileLock object is updating timestamp on: /tmp/condorLocks/36/35/594317834869511.lockc
09/18/13 21:55:26 (pid:18131) WriteUserLog::initialize: opened /home/scaudill/Projects/nsbh/aLIGO/MDCs/MDC1/gaussian_ihope_runs/full_run/966384015-971568015/nsbhtaylort2_two/inspiral_hipe_nsbhtaylort2_two.NSBHTAYLORT2_TWO.dag.nodes.log successfully

I've directed the user to set dagman_log in the condor_submit file since I think we're just plain hitting a limitation of lots of small hits to an NFS filesystem. The DAG can easily create 2000 jobs in the span of 10s of minutes if not limited by the admin or the user.

I don't see evidence of this attribute in Condor 7.8 but the check-in history of ticket 2807 suggests it might have been in Condor 7.9.

All that said, in addition to moving this log file off NFS I think I'll try the SSD approach as the response time of condor_q can easily get to a few seconds once you're in the several thousand job territory even under good circumstances.

Tom

--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678


On Wed, Sep 18, 2013 at 11:19 AM, Joe Boyd <boyd@xxxxxxxx> wrote:
Answering a question you didn't ask...

Another approach is to turn on fulldebug for your schedd, tail -f the log file, and try to figure out what the schedd is doing when the 10 second responses are occurring.  I haven't looked in a few condor versions but there were single threaded sections in the schedd where your problem may not be disk speed.  In my particular case, we had enough jobs with enough submitters and a complex enough negotiator config that when the schedd and negotiator were talking during a negotiation cycle condor_q would hang until that connection was closed.  The condor_q output would appear immediately when the connection to the negotiator closed.

If you've already made sure your problem is disk speed then ignore this but I'd assume more consistent condor_q response times if it's disk speed.

joe


On 09/18/2013 10:37 AM, Tom Downes wrote:
Should I expect moving the $(SPOOL) directory (or just the queue log) to
SSD to increase the responsivity of condor_q in addition to potentially
increased job submission rates?

I am in the regime Todd mentioned: dozens of users with 2-3000 jobs per
submit host. I want to upgrade the disks anyway and will probably switch
to SSD for condor partitions, but I'd like to set my expectations (and
cost justifications) appropriately.

I tried turning off fsync and running condor_reconfig and did not see an
obvious change. I don't intend on running in this state (or tmpfs)
anyhow. The condor_q latency can be anywhere from 0.5 seconds to 10s as
measured by "time condor_q" w/o having an obvious origin in the # of
jobs are in the queue and how many are idle/running. Those on the LIGO
list may recall I've had some NFS-related slowness lately on our cluster
so there may be other reasons for the latency beyond using older disks.

FYI: the man page for condor_reconfig does not successfully convert the
section and page numbers of the PDF into text where it tells you what
variables do not get properly reset under condor_reconfig.


--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678


On Wed, Aug 7, 2013 at 6:11 AM, Pek Daniel <pekdaniel@xxxxxxxxx
<mailto:pekdaniel@xxxxxxxxx>> wrote:

    Hi!

    So finally I eliminated the possibility of IO bottlenecks by putting
    /var/lib/condor and /var/log/condor onto tmpfs. I also set
    transfer_executable = False and should_transfer_files = no (against
    networking bottleneck).

    Now I suspect the bottleneck is the number of context switches:

    procs -----------memory---------- ---swap-- -----io---- --system--
    -----cpu-----
      r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
    us sy id wa st
      3  0     10   1325    102    217    0    0     1    25    9   11
      3  2 93  2  0
      0  0     10   1324    102    217    0    0     0     8 1945 3087
    10  8 79  3  0
      0  0     10   1323    102    217    0    0     0     0 1964 3105
    13  8 80  0  0
      0  0     10   1322    102    217    0    0     0     0 2267 3608
    12  9 79  0  0
      0  0     10   1321    102    217    0    0     0    34 1502 2395
      8  6 86  0  0
      0  0     10   1320    102    217    0    0     0     0 1969 3088
    13  8 79  0  1
      0  0     10   1319    102    217    0    0     0    84 2291 3654
    12  9 76  4  0
      1  0     10   1318    102    217    0    0     0     0 2083 3089
    23 10 67  0  0
      0  0     10   1317    102    218    0    0     0     0 2070 3303
    10  9 81  0  1
      0  0     10   1316    102    218    0    0     0    54 1257 1994
      6  5 88  2  0
      0  0     10   1315    102    218    0    0     0     0 1975 3146
    12  8 80  0  0
      0  0     10   1314    102    218    0    0     0     0 2375 3810
    12 10 79  0  0
      0  0     10   1313    102    218    0    0     0     0 2017 3158
    13  8 78  0  1

    3800/sec seems a little too much. Any idea how can I tune condor or
    linux against this?

    Thanks,
    Daniel

    2013/8/2 Todd Tannenbaum <tannenba@xxxxxxxxxxx
    <mailto:tannenba@xxxxxxxxxxx>>:

     > On 8/2/2013 9:09 AM, Dan Bradley wrote:
     >>
     >>
     >> Be aware that turning off fsync in the condor_schedd can lead to
    loss of
     >> job state in the event of power loss or other sudden death of the
     >> schedd.  This could result in jobs that were submitted shortly
    before
     >> the outage disappearing from the queue without being run.  It
    could also
     >> result in jobs being run twice.
     >>
     >> If that is acceptable for your purposes, then your problem is
    solved. If
     >> it is not acceptable, then focus on improving the performance of the
     >> filesystem containing $(SPOOL).
     >>
     >
     > FWIW, on our busy submit nodes (dozens of users with typically
    thousands of
     > running jobs), we put $(SPOOL) on a solid-state drive (SSD).
    Specifically,
     > we mount the SSD on /ssd and then put in condor_config:
     >    JOB_QUEUE_LOG = /ssd/condor_spool/job_queue.log
     > The above allows us to put the job_queue.log onto the SSD - this
    is the
     > schedd's jobs queue and the file that gets a lot of fsyncs on
    transaction
     > boundaries.  By using JOB_QUEUE_LOG, we can use a small/cheap SSD
    that does
     > not have to be large enough to hold the entire contents of the
    $(SPOOL)
     > directory.
     >
     > Performance is greatly improved and the risks Dan outlines above are
     > avoided.
     >
     > -Todd
     >
     >
     >> --Dan
     >>
     >> On 8/2/13 8:14 AM, Pek Daniel wrote:
     >>>
     >>> Thanks, the FSYNC trick solved the issue! :)
     >>>
     >>>
     >>> 2013/8/1 Dan Bradley <dan@xxxxxxxxxxxx
    <mailto:dan@xxxxxxxxxxxx> <mailto:dan@xxxxxxxxxxxx

    <mailto:dan@xxxxxxxxxxxx>>>
     >>>
     >>>
     >>>
     >>>     Are you timing just condor_submit, or are you also timing job
     >>>     run/completion rates?
     >>>
     >>>     Job submissions cause the schedd to commit a transaction to
     >>>     $(SPOOL)/job_queue.log.  If the disk containing that is slow,
     >>>     submissions will be slow.  One way to verify if this is the
     >>>     limiting factor is to add the following to your configuration:
     >>>
     >>>     CONDOR_FSYNC = FALSE
     >>>
     >>>     Another thing to keep in mind is that if you can batch
    submissions
     >>>     of many jobs into a single submit file, there will be fewer
     >>>     transactions.
     >>>
     >>>     --Dan
     >>>
     >>>
     >>>     On 8/1/13 10:17 AM, Pek Daniel wrote:
     >>>
     >>>         Hi!
     >>>
     >>>         I'm experimenting with condor: I'm trying to submit a
    lot of
     >>> dummy
     >>>         jobs with condor_submit from multiple submission hosts
     >>>         simultaneously.
     >>>         I have only a single schedd. I'm trying to stresstest this
     >>> schedd.
     >>>         These jobs are in the vanilla universe.
     >>>
     >>>         The problem is that I couldn't reach better result than 4-6
     >>>         submission/sec, which seems a little low. I can't see
    any real
     >>>         bottleneck on the machine, so I suspect that it's
    because of some
     >>>         default value of a configuration option which throttles
    down the
     >>>         submission requests.
     >>>
     >>>         Any idea how to solve this?
     >>>
     >>>         Thanks,
     >>>         Daniel
     >>>
     >>>
     >>>     _______________________________________________
     >>>     HTCondor-users mailing list
     >>>     To unsubscribe, send a message to
     >>> htcondor-users-request@cs.wisc.edu
    <mailto:htcondor-users-request@xxxxxxxxxxx>
     >>>     <mailto:htcondor-users-request@xxxxxxxxxxx

    <mailto:htcondor-users-request@xxxxxxxxxxx>> with a
     >>>
     >>>     subject: Unsubscribe
     >>>     You can also unsubscribe by visiting
     >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >>>
     >>>     The archives can be found at:
     >>> https://lists.cs.wisc.edu/archive/htcondor-users/
     >>>
     >>>
     >>>
     >>>
     >>> _______________________________________________
     >>> HTCondor-users mailing list
     >>> To unsubscribe, send a message to
    htcondor-users-request@cs.wisc.edu
    <mailto:htcondor-users-request@xxxxxxxxxxx>
     >>> with a
     >>> subject: Unsubscribe
     >>> You can also unsubscribe by visiting
     >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >>>
     >>> The archives can be found at:
     >>> https://lists.cs.wisc.edu/archive/htcondor-users/
     >>
     >>
     >>
     >>
     >>
     >> _______________________________________________
     >> HTCondor-users mailing list
     >> To unsubscribe, send a message to
    htcondor-users-request@cs.wisc.edu
    <mailto:htcondor-users-request@xxxxxxxxxxx> with
     >> a
     >> subject: Unsubscribe
     >> You can also unsubscribe by visiting
     >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >>
     >> The archives can be found at:
     >> https://lists.cs.wisc.edu/archive/htcondor-users/
     >>
     >
     >
     > --
     > Todd Tannenbaum <tannenba@xxxxxxxxxxx
    <mailto:tannenba@xxxxxxxxxxx>> University of Wisconsin-Madison

     > Center for High Throughput Computing   Department of Computer
    Sciences
     > HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
     > Phone: (608) 263-7132 <tel:%28608%29%20263-7132>

      Madison, WI 53706-1685
     >
     > _______________________________________________
     > HTCondor-users mailing list
     > To unsubscribe, send a message to
    htcondor-users-request@cs.wisc.edu
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
     > subject: Unsubscribe
     > You can also unsubscribe by visiting
     > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
     >
     > The archives can be found at:
     > https://lists.cs.wisc.edu/archive/htcondor-users/
    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu
    <mailto:htcondor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/