[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Task scheduling with wall-clock-time of SLURM nodes



On 9/8/23 12:20, Seung-Jin Sul wrote:
Hi, 

I am using SLURM nodes to create pools of HTCondor workers and I am running a separate service that watches `condor_q` and executes `sbatch` or `scacncel` on demand.


Hi Seung:

This is a great approach.  Informally, we call this technique of running HTCondor execution point services as jobs under SLURM (or other batch systems) "glidein", or "glideing-in to slurm", and it is the basis of the OSG: https://osg-htc.org/

What I am trying to do is pass a runtime constraint for a task to HTCondor so that it can schedule the task to the SLURM node that has enough life left (enough wallclock time left). 
For example, if a task needs more than 1hr estimated runtime, I want to let HTCondor schedule the task to any SLURM nodes that have more than 1hr life time.


The first thing you want to do is to have the condor_startd advertise an absolute time of when it thinks it will go away.  Adding the following to the startd config file will do so:

AliveUntil = some_utc_time_in_seconds_when_this_ep_will_vanish
STARTD_ATTRS = AliveUntil


Obviously, your startup script will have to calculate the unix time to put into the "AliveUntil" line.

Then, when the starts boots, it will advertise an AliveUntil custom classad attribute which you can use for matchmaking in your jobs, e.g. a job submit file could look like:

Requirements = Target.AliveUntil > (time() + 3600)


Let us know how this goes,

-greg


Anyone has done it? Any ideas will be appreciated.

Thank you!

Best regards, 
Seung

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/