[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] About negotiation



Thanks Ian,

I should also mention that as of 6.7.17 there is an easier way to "auto-preempt" claims without disturbing jobs. Example:

#Only allow new jobs to start on a claim for up to 20 minutes:
CLAIM_WORKLIFE = 1200

--Dan

Ian Chesal wrote:

 I want to overcome undesirable behavoir of Condor, but I failed to
find the right configuration entries. The problem is follows:
When job finishes on a machine and there is another job in queue from
this user matching to the same machine, Condor is running the next
user's job without negotiation. So, it does not care about user
priorities. There are no entries in NegotiatorLog about starting the
second user's job, and in SchedLog:
Starting add_shadow_birthdate(1408.0)
Started shadow for job 1408.0 on "<192.168.201.2:32773>", (shadow pid
=
6681)

The desireble behavoir after job completion is to loop over all users
and give free resource to those one with lower effective priority.
Please, can you help me to solve this problem?

This is part of the "high throughput" portion of Condor. A claim on a
startd remains in place until it: a) runs out of jobs to process from
the cluster; or b) gets preempted by another claim. As long as there's
no one with a lower user priority value in the system it's much more
efficient to keep cycling through the jobs from the current cluster
being executed than re-negotiate because you don't have to tear down and
setup the shadow again.

It sounds like you've disabled user priority preemption on your central
negotiator. Is that the case? What is PREEMPTION_REQUIREMENTS set to on
your central negotiator? Preemption is only considered if this
expression evaluates to true. If you want preemption to be based on user
priorities, make this expression compare the remote (running job) user
priority with the current (user being negotiated) user priority. See the
default condor_config file for an example.

You can "auto-preempt" jobs that have been running for longer than X
minutes if you really want to have a startd re-negotiated after ever job
completes. We actually do this here at Altera and it works fairly well.
As long as your jobs are long (say 20 minutes or greater) the impact on
through put is pretty minimal. I would also add a cautionary note that
we run ONLY vanilla universe jobs that do no checkpoint so this scheme
works great. For checkpoint-able jobs or a mixed bag of jobs I don't
think you'll want to go this route.

To trigger "auto-preemption" you need to use the MAX_JOB_RETIREMENT_TIME
setting and the PREEMPT setting in your startd configuration file.

You set MAX_JOB_RETIREMENT_TIME to be something much longer than any job
you expect to run in your system:

	MAX_JOB_RETIREMENT_TIME = 9676800

And you set PREEMPT to be true after the job has been running for say 5
minutes:

	PREEMPT = ( $(ActivationTimer) > 300 )

This automatically sets the job to be "preempting" after it's run for 5
minutes. But the retirement timer lets the job finish normally. The
difference is that because the job was "preempted" and there's no
waiting schedd claim on the startd causing the preemption, the startd is
renegotiated like you want.

If you go with this approach I also suggest setting the schedd
preemption timeout to be small and possibly disallowing negotiator
preemptions:

	PREEMPTION_REQUIREMENTS = FALSE
	REQUEST_CLAIM_TIMEOUT = 120

You can get more information on all these configuration settings from
the Condor documentation online.

- Ian


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users