[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] preemption (again)


This is something that we've grappled with here at Altera as well. Our
tests with extending MaxJobRetirementTime to something ridiculously long
(like 2 weeks) have worked very well. We use the rank expression:

RANK = TARGET.JobPrio * 1000

And now user-assignable jobprio (the "priority=[-20:20]" tag in a a
condor_submit file) determines preemption on the clients in a fairly
predictable manner. With our long running jobs though the one caveat is
most jobs in the system can end up in the retiring state very quickly
after starting execution. It's not necessarily bad. It just requires
some additional explaining to users that retiring and running, from
their point of view, can be thought of as the same state.

As I understood the Condor system when a job finishes (normally, not
preempted) the resource does not necessarily negotiate with the master
for a job from any jobs in the pool, rather it may continue to take jobs
directly from the submitter if userprio so determines. Preemption
ensures the machine talks to the master for its next job. If your jobs
never get preempted it's possible for a submitter to hang on to a
resource for an abnormally long amount of time. Maybe even indefinitly
(but I'm not sure about that). Using the MaxJobRetirementTime approach
ensures that your jobs pass through the preemption state, but are given
sufficient time to finish normally, and then the resource reconnects to
the master to fetch it's next job. Hence the reason no one here is
saying set PREEMPT = FALSE on your starters.

Bear in mind that I'm reasonably new with Condor so maybe I've just had
it sorted out all wrong in my head for the last 2 months. Some one
correct me if I'm wrong please.


> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of matthew hope
> Sent: October 6, 2004 4:47 AM
> To: Condor-Users Mail List
> Subject: [Condor-users] preemption (again)
> I am moving onto testing 6.7.1 and was rereading through the
> manual (much better by the way) for a refresher
> got to
> http://www.cs.wisc.edu/condor/manual/v6.7/7_3Running_Condor.ht

Reason number 3 is the owner (machine) preference: controlled by the
RANK expression in the configuration file (sometimes called the startd
rank or machine rank). The RANK expression is evaluated as a floating
point number. When one job is running, a second idle job that evaluates
to a higher RANK value tells the condor_ startd to prefer the second job
over the first. Therefore, the condor_ startd will evict the first job
so that it can start running the second
(preferred) job. For more on RANK, see section 3.6.

This implies that it is impossible to mantain a tiered ranking (or
indeed any other worker controlled ranking) at the same time as avoiding
preemption...is this correct?

This is a real PITA, can there not be another parameter which allows the
control of this process (akin to PREEMPTION_REQUIREMENTS).

I know I can ameliorate the problem with retirement settings but this is
very annoying.

It is reasonable to assume that just because a machine prefers a job
does not mean it automatically wants to preempt its current one

Condor-users mailing list