[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Job Runtime Problem
- Date: Tue, 16 Jul 2013 14:00:34 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Job Runtime Problem
You should take the user logs from your runs and send them through
It'll give you an idea of when jobs started, where, how long they took etc.
On 07/16/2013 01:57 PM, Vishal Shah wrote:
The goal is to run N jobs to compute an algorithm on N chunks of data
simultaneously. Currently there are 22 chunks of data and as a result 22
jobs are submitted through the submit script. The queue has no load and
when the 22 jobs are submitted, all of them begin at the same time. The
times that I provided below are the run times for the algorithm.
> Date: Tue, 16 Jul 2013 13:48:49 -0400
> From: matt@xxxxxxxxxx
> To: htcondor-users@xxxxxxxxxxx
> CC: vishal.b.shah@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Job Runtime Problem
> On 07/16/2013 01:23 PM, Vishal Shah wrote:
> > Hello,
> > When submitting N instances of a job, generally N/2 jobs run in the
> > expected time and the other N/2 jobs take longer to complete. The
> > has 10 nodes each with 32 slots and uses a shared filesystem
> > (GlusterFS). All of the executables and data files are located on the
> > shared file system; however, the problem does not seem to be an I/O or
> > network bottleneck.
> > When submitting 2 instances, the two times are the following:
> > Instance 1
> > real7m13.950s
> > user5m36.766s
> > sys0m14.436s
> > Instance 2
> > real6m2.555s
> > user5m35.747s
> > sys0m13.170s
> > When submitting 22 instances, the difference in times are more drastic.
> > The two categories that the times fall into are the following:
> > Category 1:
> > real18m28.193s
> > user5m39.153s
> > sys0m15.111s
> > Category 2:
> > real6m12.578s
> > user5m36.433s
> > sys0m12.644s
> > Does anybody have insight into this issue?
> > Thanks,
> > Vishal
> Share your goal, so we can tell what the issue may be.
> FYI, condor_submit <-> condor_schedd communication is very chatty and
> the condor_schedd is single threaded. The schedd may have ignored your
> submit for a period while doing some job maintenance, which resulted in
> a 18min runtime for submit.