[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Runtime Problem



Hey Matt,

The following is the result of 9 jobs (tenth job did not run due to dataset):

http://condorlog.cse.nd.edu/logs/fb/24/fb24c50065f54fc1c338f1dfab261caa/

4 jobs finished in one time interval and the 5 jobs finished significantly later.

Thanks,
Vishal


From: vishal.b.shah@xxxxxxxxxxx
To: matt@xxxxxxxxxx; htcondor-users@xxxxxxxxxxx
Date: Tue, 16 Jul 2013 13:57:07 -0400
Subject: Re: [HTCondor-users] Job Runtime Problem

Hey Matt,

The goal is to run N jobs to compute an algorithm on N chunks of data simultaneously. Currently there are 22 chunks of data and as a result 22 jobs are submitted through the submit script. The queue has no load and when the 22 jobs are submitted, all of them begin at the same time. The times that I provided below are the run times for the algorithm.

Thanks,
Vishal

> Date: Tue, 16 Jul 2013 13:48:49 -0400
> From: matt@xxxxxxxxxx
> To: htcondor-users@xxxxxxxxxxx
> CC: vishal.b.shah@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Job Runtime Problem
>
> On 07/16/2013 01:23 PM, Vishal Shah wrote:
> > Hello,
> >
> > When submitting N instances of a job, generally N/2 jobs run in the
> > expected time and the other N/2 jobs take longer to complete. The system
> > has 10 nodes each with 32 slots and uses a shared filesystem
> > (GlusterFS). All of the executables and data files are located on the
> > shared file system; however, the problem does not seem to be an I/O or
> > network bottleneck.
> >
> > When submitting 2 instances, the two times are the following:
> >
> > Instance 1
> > real7m13.950s
> > user5m36.766s
> > sys0m14.436s
> >
> > Instance 2
> > real6m2.555s
> > user5m35.747s
> > sys0m13.170s
> >
> > When submitting 22 instances, the difference in times are more drastic.
> > The two categories that the times fall into are the following:
> >
> > Category 1:
> > real18m28.193s
> > user5m39.153s
> > sys0m15.111s
> >
> > Category 2:
> > real6m12.578s
> > user5m36.433s
> > sys0m12.644s
> >
> > Does anybody have insight into this issue?
> >
> > Thanks,
> > Vishal
>
> Share your goal, so we can tell what the issue may be.
>
> FYI, condor_submit <-> condor_schedd communication is very chatty and
> the condor_schedd is single threaded. The schedd may have ignored your
> submit for a period while doing some job maintenance, which resulted in
> a 18min runtime for submit.
>
> Best,
>
>
> matt
>

_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/