[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fill the htcondor pool breadth first



Using SlotId in NEGOTIATOR_PRE_JOB_RANK _expression_ works for a pool with static slots. For a pool with partitionable/dynamic slots, you want to set CLAIM_PARTITIONABLE_LEFTOVERS and NEGOTIATOR_DEPTH_FIRST.
Iâm surprised that whether your jobs are in a single cluster or batch or not makes a difference.

The CLAIM_PARTITIONABLE_LEFTOVERS setting is used by your schedd and affects all jobs going to all pools. This makes it difficult to get breadth-first scheduling for one pool and depth-first for another pool with the same schedd.
For I/O intensive jobs, you can create a custom resource for I/O that those jobs would request. Then the scheduler will enforce that only a certain number of I/O jobs can run at a time on each machine.

 - Jaime

On Feb 28, 2024, at 6:57âAM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

This issue is easily reproducible, if we submit a single job in a batch, all the batches are going to the same node. If we submit multiple jobs in a batch then they are using different worker nodes. 

Changing RANK or changing the following _expression_ doesn't help.

# condor_config_val NEGOTIATOR_PRE_JOB_RANK
(10000000 * My.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory

# condor_config_val NEGOTIATOR_POST_JOB_RANK
(RemoteOwner =?= UNDEFINED) * (ifthenElse(isUndefined(KFlops), 1000, Kflops) - SlotID - 1.0e10*(Offline=?=True))



Thanks & Regards,
Vikrant Aggarwal


On Tue, Feb 27, 2024 at 11:47âAM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

In htcondor pool with dynamic slots followed article [1] to fill the pool breadth first still sched is running the jobs on a single worker machine (Let's say if batch of 10 jobs is submitted all the 10 jobs are landing on one worker node instead of spreading across available 5-6 worker nodes). These jobs are I/O intensive for local disk hence we want to distribute them across worker nodes. Anything else I need to do to make it work reliably? 

Just in case it matters, this sched is also used to flock the jobs towards the pool where we want to fill the depth first however in the job requirement pool (primary master pool) with breadth first configuration is mentioned hence not sure whether flock part is relevant or not.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/