[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Usage of JobStart or ActivationTimer in startd RANK expression?

That does get complicated. I think all of it is possible, but weaving everything together would be tricky. I canât give you a recipe here. Maybe others can chime in.
Iâll give my immediate thoughts inline...


> On Apr 2, 2020, at 4:32 AM, Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> wrote:
> Hi Jaime,
> On 4/1/20 9:32 PM, Jaime Frey wrote:
>> JobStart and $(ActivationTimer) are only valid once a job starts running (i.e. the slot is in the Busy activity).
>> Thus, they are not suitable for use in a RANK expression.
>> $(ActivityTimer) is always valid, and marks the time since the slot changed its Activity value (Busy, Idle, etc).
>> Have you looked at the MAXJOBRETIREMENTTIME parameter? This seems like an ideal use for it.
> We looked at it, but we possibly misunderstood it as we initially wanted
> to "solve" all our preemption only via the negotiator and thought of
> MAXJOBRETIREMENTTIME to be relevant for the startd only.

It sounds like MAXJOBRETIREMENTTIME (minimum run time is a better name) is almost good enough for all of your minimal runtime needs. Itâs an expression set on the startd that can reference job attributes. So each machine can have a different minimal runtime, and that runtime can be different for each job. The negotiator will not make preempting matches on the slot until the retirement time is about to expire.
The one drawback is that a jobâs retirement time is set when it starts executing, and canât vary depending on what type of job would preempt it. So your policy for CPU jobs on GPU machines canât be done this way.

In your existing configuration for minimal runtime, you can use $(ActivationTimer) or $(ActivityTimer), just make sure to also check (Activity==âBusyâ) and account for cases where the slot isnât in Busy activity.

> If I may, here are the "rules" we try to implement and currently fail
> miserably - do you think those rules are possible to put into a 8.8 or
> 8.9 pool configuration? For us, we are currently unsure which of these
> rules should go into startd configuration and which into the negotiator...
> Over the years our pool went from only a single type of machine to a
> large collection of different hosts which we will only differentiate to
> be "CPU" hosts (ranging from 4 physical cores without HT/SMT to two
> socket machines with 64+64 logical cores) and "GPU" hosts (ranging from
> small nodes with ancient single GT640/GTX750 card set-ups to recently
> added systems with 8 state of the art GPUs).
> CPU hosts:
> * we guarantee a minimal runtime of each job (wall-clock time!), which
> should always be honored
> * actual time is locally configured, may differ between hosts and type
> of CPU-only hosts and we will try to adjust those to allow smooth
> operation for all users (e.g. 50% with 5 hours, 25% with 10 hours, 155
> with 20 hours and the rest being unlimited).
> * users have to add their expected run time into the submit file for
> matching those groups - which in itself may already be hard due to
> various different CPUs available
> GPU host:
> * again, we guarantee a minimal runtime of GPU jobs (wall-clock time!),
> which should always be honored
> * as above, locally configured min run time
> * if CPU resources are available on a GPU host, a CPU job can be started
> there, but it does only have a guaranteed runtime against other CPU-only
> jobs
> * a GPU job is always allowed to preempt a CPU-only job regardless of
> runtime (obviously, if a GPU resource is available)
> To further complicate the matter:
> * interactive jobs:
>  * for testing purposes, users may submit interactive jobs (can the
> number and run time be limited?)
>  * these may request a number of GPU and CPU resources (obviously
> within the limits of available machines)
>  * due to their interactive nature, these jobs should be matched as
> quickly as possible, without breaking the guaranteed runtime rules above
>  * CPU-only jobs should only match against non-GPU hosts

You can use submit requirements and submit transforms to constrain the interactive jobs.
You can reject interactive jobs that request too much time, or silent cap the time requested.
You can also add attributes to tag the interactive jobs as special to make matchmaking easier.

On our local CHTC pool at UW-Madison, we run a second negotiator that only matches interactive jobs, whose negotiation cycle can be much faster than the main negotiatorâs. We use a submit transform to put interactive jobs in a separate accounting group and have the second negotiator only match jobs from that accounting group (a new feature coming in 8.9).

> * Overall preemption by negotiator should be governed by simple
> (default) rule of 20% better effective user prio.
> * dedicated scheduler/parallel universe:
>  * rarely needed by our users
>  * but if needed, high priority a.k.a. not a long waiting time,
> possibly by manually adjusting defrag daemon?
>  * how to integrate all this with 1-2 dedicated schedulers for subsets
> of machines?
>  * these jobs once matched should run until done, i.e. no preemption by
> other jobs

For parallel jobs, you use startd RANK to allow the dedicated scheduler to preempt any other jobs. Youâll have to weave that in with other uses of startd RANK. You can decide if parallel jobs have to wait for preempted jobsâ minimal runtime, but remember that other slots for the parallel job will be held idle until all required slots are available. The defrag daemon is not involved with dedicated scheduler jobs, other than possibly configuring it to leave machines running parallel jobs alone.
Parallel jobs and slots with multiple cpus gets complicated. How that would work depends on your slot configuration and your parallel jobs (1 core or full machine per job node).

> * defrag daemon (looks to be working more or less orthogonal to all
> these rules)
> Now the big question - is this too complex to implement or should this
> be possible (and if so, how?)
> Cheers and thanks a lot in advance!
> Carsten
> PS: Obviously, once we get this set-up up and running, we will document
> it as a complex example configuration if you want to re-use it!
> -- 
> Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
> CallinstraÃe 38, 30167 Hannover, Germany
> Phone: +49 511 762 17185