[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Choosing suitable universes for jobs



Tim, Dimitri, thanks a lot for the links, I think I found the answer - partitionable slots + condor_defrag. But still have some follow up questions.

Tim, I'll try to give you a specific example to describe the use case:

For example in my cluster I have a single partitionable slot with 10 CPUs.

Then I have 2 types of jobs - Job A requires 10 CPUs to run while Job B requires 1 CPU to run.

I submit several thousands of Job B and one Job A. There is a potential starvation problem for Job A cause there may be never all 10 CPUs available to run it even if the job priority is higher.

SGE solves this problem by "resource reservation" which is described here:
http://www.gridengine.info/2006/05/31/resource-reservation-prevents-parallel-job-starvation/

HTCondor offers condor_defrag daemon that would periodically drain machines. What concerns me about it is the periodical fashion of draining and not clear algorithm of choosing machines to do that. What if I don't have any big jobs currently. waiting for multi core slot, why would I need draining at all? Or what if the only machine that is suitable for a particular job never gets picked for draining?

So the algorithm on how defrag works is not transparent and it seems that the daemon won't solve the starvation problem in some cases...

Please advise,

Thank you,
Dmitry



On Thu, Jun 13, 2013 at 9:48 AM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
On 06/13/2013 11:01 AM, Dmitry Grudzinskiy wrote:
> I asked similar question yesterday (though probably wasn't clear enough)
> and received a possible solution (thank you) but still feel that I don't
> understand smth.
> Being new to condor I'm getting a little confused when picking suitable
> universes for our jobs.

You might want to read this:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots

It describes how to set up some machines as "multi-cpu" slots and keep
"small jobs" off them. You could do something along those lines.

If your jobs are in a dag, the default scheduling is breadth-first, so
if your small jobs are children of the big jobs, all big jobs are likely
to get scheduled to run first anyway. That is within one dag, though.

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/