[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] JOBS IDLE IN QUEUE



On 8/10/05, Kresho, John (US SSA) <john.kresho@xxxxxxxxxxxxxx> wrote:
> I am currently running a Windows Condor Master and Windows compute nodes.  I also have Linux nodes, but they only submit jobs to the Windows Master.  Linux nodes do not execute any jobs.
> 
> When the Linux nodes submit jobs, it seems that the jobs stay in the queue for at least 30 seconds before running.   This happens when there is no load on the system, and only one job is submitted.
> 
>  Is this normal, or is there a setting that can be changed to reduce this time in the queue.

This is normal and to be expected.

condor is optimized for High *throughput* not low latency. It
therefore only goes through the matchmaking process as and when
required or at fixed intervals (just to check)
Submitting a job will trigger a negotiation cycle (matching jobs to
machines) though releasing one won't.
The negotiation cycle can be made more frequent than the default (5
mins I believe) with (say for 60 seconds)

NEGOTIATOR_INTERVAL = 60

In the config of the negotiation machine but note that there is a
*hardcoded limit* of 15 seconds between any two negotiation cycles no
matter how they were initiated.

Note that the time a negotiation cycle takes is roughly proportional
to the number of queues with jobs, the number of jobs on the queue*
and the number of machines available to run jobs.

30 seconds isn't so bad - I regularly have to explain this to my users
and is a tough concept to get across to people - try to educate your
users into submitting jobs with run times such that the initial lag of
a few seconds or minutes is insignificant verses the throughput
advantage gained.

If you *really* need low latency for many little jobs (note that
condor is faster once you have claimed a machine since you can keep
the claim and send more jobs to it if there is no one with a higher
priority than you) then there is a third party add on by the techion
guys

http://www.cs.technion.ac.il/Labs/dsl/completed_projects/condor-llic/llic_web_site.htm

I have a feeling this is NOT the route you want to go down just yet...

* This can be made proportional to the number of clusters very simply
if all jobs in a cluster have the same requirements thus stopping on
the first job to fail to match to any available machine.