[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Scheduling interactive and Batch use of GPUs



Hi Chris,

Hope time finds you well!

Your questions below are asking about various knobs and mechanisms in HTCondor, but IMHO the first step is to decide what scheduling policy you want. I suggest you momentarily forget about HTCondor knobs, and instead simply tell us in plain english what you want to happen --- without any references to HTCondor. Once you know what policy you want (this can be hard to do, esp if you need lots of agreement within an organization!), the next step is to "implement" it by configuring various HTCondor knobs. Usually the implement step is easier than the figuring out what you really want to do :).

So, with that said, I understand you want to mix interactive and batch jobs on the same server. Do you want to prioritize interactive jobs over batch jobs, or vice versa? Should interactive jobs be removed if they cannot be started within X amount of minutes? Can batch jobs be preempted (i.e. killed, and then restarted over again later)?

Assuming you don't want interactive jobs starting in the middle of the night, typically mixing interactive and batch requires making a fundamental decision between either 1) allowing the preemption of batch jobs, or 2) reserving some percent of resources exclusively for interactive use if preemption cannot be tolerated.

For instance, if preemption is not allowed, maybe you want a policy like "1 out of 4 GPU devices will be reserved to only run interactive jobs", or something like "GPU devices will be reserved for interactive jobs between 9am and 9pm, and batch jobs will only be allowed to run between 9pm and 9am". Or if you can tolerate preemption, you can increase utilization by having a policy like "all 4 GPU devices prefer to run interactive jobs, but to maximize utilization, batch jobs will be allowed to start iff there are no interactive jobs and if an idle interactive job has waited more than 20 minutes to start, then preemption of a batch job is allowed".

Every policy has pros and cons, you cannot make everyone happy all the time (unless you have so many resources available that there is no contention!), so the trick is trying to understand your users and their typical job workload to guide your decisions.

Hope the above helps,
regards,
Todd

On 1/23/2018 6:26 AM, chris.brew@xxxxxxxxxx wrote:
Hi,

Weâve just been given some money to buy a nice shiny GPU test box.

I would like to make the resources available to local users for interactive use (itâs a test box after all) but also for local and grid batch use (we want to test this too).

I know condor can manage the scheduling of the GPUs with the âuse feature : GPUsâ knob, I was wondering about how best to integrate the local interactive users.

My initial thought is to get local users to submit interactive jobs, that should be fine as long as the resources are not too heavily loaded, but if (whwn) the system gets more loaded we may end up with some dead time if the interactive job does not get scheduled until the middle of the night or over the weekend.

Now maybe thatâs the sign to ask for more money to expand the resource but in lieu of that I was looking at either âJob Deferralâ or âComputing on Demandâ.

If a user submitted a deferred job on Friday evening, would the job block the resource over the weekend or would it not attempt to match until itâs deferral time came up? And I assume I can use whether the job is interactive in the startd rank expression to heavily prioritise the interactive jobs.

Or would the âComputing on Demandâ feature work with GPUs? Is it even possible to suspend a GPU job and use the GPU for another job?

Is there another way to achieve this that I havenât thought of?

Many Thanks,
Chris.

--
Dr Chris Brew
Scientific Computing Manager
Particle Physics Department
STFC - Rutherford Appleton Laboratory
Harwell Oxford,
Didcot
OX11 0QX
+44 1235 446326


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685