[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Cluster configuration for mixed usecases

Dear HTCondor experts,

I'm wondering how a good setup for mixed usecases would look like. 

We use Singularity containers, and since checkpointing for these is sadly still unsupported (it should work fine with a OCI-based implementation such as runc, though - but HTCondor does not support that)
we (usually) have to live with a setup where evicting a job means losing the processing time invested in it. 

Our users submit jobs which may run anything between: 
- Very short jobs < 1 hour. 
- Short jobs of ~1-2 hours. 
- Medium jobs of 1-2 days. 
- Long jobs running 1 week. 
- In rare cases, jobs with 3 weeks of runtime. 
- Interactive jobs (for which we have reserved special partitionable slots, so I'll exclude them from the following discussion). 

Most common are the medium and short jobs, so in general, our old PBS-based cluster's medium queue is often filled to the limit,
but sometimes short jobs come in between. 

What I am looking for is something like a "best practice" to handle such mixed workloads in real life when checkpointing is not possible. 
Is Job-suspending heavily used? Is it a sensible approach in general? 

What I could think of, trying to completely free myself of the boundaries dictated by classical queues as they exist in PBS / Slurm, 
could be the following complex ruleset: 

1) Any jobs can fill up all resources (apart from the "interactive job" specialty mentioned previously). 

2) Users should specify an "expected wallclock time" for their job to allow ranking / more educated handling of preemption. 
   It seems HTCondor does not offer a built-in attribute for that, though - is this generally solved via custom JobAds? 
   Is there a commonly used name? E.g. JobRuntime? 

3) If a Job with "JobRuntime > 48h" is running for more than 10 % of it's expected "JobRuntime", never "kill" it
   (but potentially suspend it, see later). 

4) Never kill jobs with (JobRuntime >= 1 week). 

5) If short jobs (< 12h expected wall clock time) come, kill jobs which can be killed according to (3) and (4), i.e. with
   JobRuntime <= 48 and actual wall clock time < 10% of that. 
   Then, in a second step if more resources are needed, suspend other running jobs, ranked by their collected suspension time,
   i.e. jobs already suspended for a long time (in total) will be left untouched. 

6) Never suspend jobs which have been suspended for more than 25 % of their total expected wall clock time. 

7) Kill jobs which have run longer than 1.5*JobRuntime. 

Does this sound reasonable? 
Does somebody have ideas / experiences implementing something like this in HTCondor? 
I'm a bit at a loss on how the RANK expression should look like. 

My first try would be:
WANT_HOLD = TotalJobRunTime > 1.5*JobRuntime
WANT_HOLD_REASON = "Job exceeded announced JobRuntime by more than 50 %."
TARGET_IS_SHORT = (TARGET.JobRuntime < 12 * (60*60) )
MY_JOB_IS_LONG = (JobRuntime > 48 * (60 * 60))
MY_JOB_IS_VERY_LONG = (JobRuntime > 7 * 24 * (60 * 60))
RANK = $(TARGET_IS_SHORT) * 100 - ifThenElse($(MY_JOB_IS_VERY_LONG) || ($(MY_JOB_IS_LONG) && (TotalJobRunTime > (0.1*JobRuntime))), 100, 0)

But I'm already confused: 
A) How to access "JobRuntime" of the running job and the job who want's to start? 
   Can I really access one with TARGET.JobRuntime and the other with just JobRuntime in the STARTD configuration? 
   Or is it MY.JobRuntime ? 
B) Is "TotalJobRunTime" the correct variable to use if jobs can be suspended or even killed? 
C) How do I allow for the mixed behaviour, i.e. some jobs should be considered for killing
   (i.e. those with $(MY_JOB_IS_LONG) (but not very long) and (TotalJobRunTime < (0.1*JobRuntime))),
   and all others for suspension only? 
D) How do I let "SUSPEND" be the action which is triggered when a job with higher rank arrives? 
E) How to factor in the ranking mentioned in (5) ? 

If (C) is not possible: 
Can a slot have more than one suspended job assigned to it? I.e. how much swap do I need to allocate per machine, is 1.25*MachineMemory sufficient? 

Sorry for the long mail, but I found documentation not too clear, and maybe there are completely different proposals which may attack the issues in a much cleaner way - 
so I'm really looking for a best practice from real life ;-). 
How are others handling such mixed kinds of jobs? 

All would be easier if there would be simple OCI support (such as for runc and future Singularity) so the container runtime could handle checkpointing, 
potentially even in userspace (a.k.a. "rootless containers" / user namespace containers). Sadly, HTCondor is not there yet,
but still has a manifold of differing, separate container implementations with varying feature sets and of varying quality.