[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] slow submission rate



Hi!

So finally I eliminated the possibility of IO bottlenecks by putting
/var/lib/condor and /var/log/condor onto tmpfs. I also set
transfer_executable = False and should_transfer_files = no (against
networking bottleneck).

Now I suspect the bottleneck is the number of context switches:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0     10   1325    102    217    0    0     1    25    9   11  3  2 93  2  0
 0  0     10   1324    102    217    0    0     0     8 1945 3087 10  8 79  3  0
 0  0     10   1323    102    217    0    0     0     0 1964 3105 13  8 80  0  0
 0  0     10   1322    102    217    0    0     0     0 2267 3608 12  9 79  0  0
 0  0     10   1321    102    217    0    0     0    34 1502 2395  8  6 86  0  0
 0  0     10   1320    102    217    0    0     0     0 1969 3088 13  8 79  0  1
 0  0     10   1319    102    217    0    0     0    84 2291 3654 12  9 76  4  0
 1  0     10   1318    102    217    0    0     0     0 2083 3089 23 10 67  0  0
 0  0     10   1317    102    218    0    0     0     0 2070 3303 10  9 81  0  1
 0  0     10   1316    102    218    0    0     0    54 1257 1994  6  5 88  2  0
 0  0     10   1315    102    218    0    0     0     0 1975 3146 12  8 80  0  0
 0  0     10   1314    102    218    0    0     0     0 2375 3810 12 10 79  0  0
 0  0     10   1313    102    218    0    0     0     0 2017 3158 13  8 78  0  1

3800/sec seems a little too much. Any idea how can I tune condor or
linux against this?

Thanks,
Daniel

2013/8/2 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
> On 8/2/2013 9:09 AM, Dan Bradley wrote:
>>
>>
>> Be aware that turning off fsync in the condor_schedd can lead to loss of
>> job state in the event of power loss or other sudden death of the
>> schedd.  This could result in jobs that were submitted shortly before
>> the outage disappearing from the queue without being run.  It could also
>> result in jobs being run twice.
>>
>> If that is acceptable for your purposes, then your problem is solved. If
>> it is not acceptable, then focus on improving the performance of the
>> filesystem containing $(SPOOL).
>>
>
> FWIW, on our busy submit nodes (dozens of users with typically thousands of
> running jobs), we put $(SPOOL) on a solid-state drive (SSD). Specifically,
> we mount the SSD on /ssd and then put in condor_config:
>    JOB_QUEUE_LOG = /ssd/condor_spool/job_queue.log
> The above allows us to put the job_queue.log onto the SSD - this is the
> schedd's jobs queue and the file that gets a lot of fsyncs on transaction
> boundaries.  By using JOB_QUEUE_LOG, we can use a small/cheap SSD that does
> not have to be large enough to hold the entire contents of the $(SPOOL)
> directory.
>
> Performance is greatly improved and the risks Dan outlines above are
> avoided.
>
> -Todd
>
>
>> --Dan
>>
>> On 8/2/13 8:14 AM, Pek Daniel wrote:
>>>
>>> Thanks, the FSYNC trick solved the issue! :)
>>>
>>>
>>> 2013/8/1 Dan Bradley <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>>
>>>
>>>
>>>
>>>     Are you timing just condor_submit, or are you also timing job
>>>     run/completion rates?
>>>
>>>     Job submissions cause the schedd to commit a transaction to
>>>     $(SPOOL)/job_queue.log.  If the disk containing that is slow,
>>>     submissions will be slow.  One way to verify if this is the
>>>     limiting factor is to add the following to your configuration:
>>>
>>>     CONDOR_FSYNC = FALSE
>>>
>>>     Another thing to keep in mind is that if you can batch submissions
>>>     of many jobs into a single submit file, there will be fewer
>>>     transactions.
>>>
>>>     --Dan
>>>
>>>
>>>     On 8/1/13 10:17 AM, Pek Daniel wrote:
>>>
>>>         Hi!
>>>
>>>         I'm experimenting with condor: I'm trying to submit a lot of
>>> dummy
>>>         jobs with condor_submit from multiple submission hosts
>>>         simultaneously.
>>>         I have only a single schedd. I'm trying to stresstest this
>>> schedd.
>>>         These jobs are in the vanilla universe.
>>>
>>>         The problem is that I couldn't reach better result than 4-6
>>>         submission/sec, which seems a little low. I can't see any real
>>>         bottleneck on the machine, so I suspect that it's because of some
>>>         default value of a configuration option which throttles down the
>>>         submission requests.
>>>
>>>         Any idea how to solve this?
>>>
>>>         Thanks,
>>>         Daniel
>>>
>>>
>>>     _______________________________________________
>>>     HTCondor-users mailing list
>>>     To unsubscribe, send a message to
>>>     htcondor-users-request@xxxxxxxxxxx
>>>     <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>>>
>>>     subject: Unsubscribe
>>>     You can also unsubscribe by visiting
>>>     https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>>     The archives can be found at:
>>>     https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/