[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] Re: Force all jobs by a user on same execute node



Another approach you could take would be to choose an available target machine in a command pipe in the submit description, and set a requirement where TARGET.Machine is bound to that machine name, and then submit the various jobs with multiple queue statements, all sharing that requirement.

So if your multi-process job as a whole needs 10 cores and 40g of memory, you'd have something like this:

Include command : " /usr/bin/condor_status -limit 1 -constraint 'SlotTypeID==1 && State is ""Unclaimed"" && Cpus>=10 && Memory>=40960' -format 'TargetMachine = %s\n' Machine "
If ! defined TargetMachine
        Error: no currently available machine meets job criteria
Endif
Requirements = ( TARGET.Machine is "$(TargetMachine)" )

Executable = /bin/sleep
Request_memory = 10g
Request_cpus = 3
Arguments = 5m
Queue

Request_memory = 8g
Request_cpus = 2
Arguments = 10m
Queue

Request_memory = 22g
Request_cpus = 5
Arguments = 15m
Queue

Basically, that include command line is running a condor_status query on your cluster to look for a machine which is partitionable and unclaimed and can meet the jobs' total requirements, and causes the submit to set the TargetMachine macro to that value, and then each job is spawned with different arguments and request values, with the requirements expression demanding that only that single machine be matched to all of the jobs.

This has its limitations, as it's possible that the machine you identify at submit time with the included command will be matched to a different job first and thus wind up not qualified to run these jobs, leaving some of the components of the job idle until the other jobs wrap up and leave the machine, but depending on how busy your schedd is and what the job requirements are, this may not be a concern.

It might also be necessary to set the condor_status memory value to something a little more than the sum of all the jobs' requests, to deal with possible quantization of the values in the negotiation process, but if you have a lot of machines that will have a lot more memory than required that's also a small concern.

What I like about this approach is that it eliminates the need for any scripting or whatnot outside of the submit description file.

Michael Pelletier
Principal Technologist
High Performance Computing
Infrastructure & Workplace Services

C: +1 339.293.9149
michael.v.pelletier@xxxxxxx

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Carsten Aulbert
Sent: Friday, January 5, 2024 9:40 AM
To: gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Force all jobs by a user on same execute node

Hi

On 1/5/24 15:17, gagan tiwari wrote:
>                             Yes these jobs will use shared memory on
> the node and interact with each other. Which is why it's needed to run
> all of them on the same execute node.

ok.

Then let's assume you have three types of jobs (I'm just guessing here),
1 of type A, 8 of type B and 1 of type C. Obviously, I'm just making up numbers here.

A and B require 1GByte of RAM and 1 core each and jobs of type C required 10 Gbyte of RAM and 4 cores.

Then, I'd suggest to following:

(1) Create a wrapper shell script, starting all jobs, e.g.
the following snippet called wrapper_script.sh - needs to be executable!

--8><----8><----8><----8><----8><----8><----8><--
#!/bin/bash

# start all jobs and put them into the "background"
# just making up command line arguments here A arg1 arg2 arg3 & B 1 & B 2 & B 3 & B 4 & B 5 & B 6 & B 7 & B 8 & C &

# wait for all jobs to finish
wait
echo "done"
exit 0
--8><----8><----8><----8><----8><----8><----8><--

then the submit file will request resources for the SUM of all jobs, e.g. the submit file could look like this

--8><----8><----8><----8><----8><----8><----8><--
request_cpus = 13
# some head room for RAM
request_memory = 20000

executable = wrapper_script.sh
Queue
--8><----8><----8><----8><----8><----8><----8><--

Obviously, that would be the bare minimum, there is no error handling, no handling of output etc, but hopefully enough to get going.

Condor will then try to find a machine which has enough resources available and start the wrapper script which in turn will take care of the rest.

Does that make sense?

Cheers

Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185