I've been enjoying HTCondor Week 2022 so far and wanted to continue the conversation started by Peter Couvares' "Future of Computation" talk, specifically about "Immediate" or "Online" jobs that need to run right away. Since I'm attending virtually and can't do so at the Terrace social tonight I'm turning to the users group.
To summarize the need/ask: There are occasionally jobs where the output is needed right away in order to be useful, and HTCondor doesn't currently have a great way of handling this. This is opposed to typical batch jobs where the time spent in queue is usually irrelevant (to a certain point). I understand that the desire for immediate turn-around time is at odds with the high-throughput nature of HTCondor, but the scheduling system needs to be aware of these exceptions if the resources/EPs are to be shared with normal batch jobs.
The solution presented in the talk was to have dedicated slots set up for these jobs. The downside to this is low utilization of those resources when there are no jobs of this type, or impacting other currently running jobs on the machine if oversubscribing the cores. The second downside didn't seem to matter much for this case and was what they went with.
It was also mentioned during the post-talk discussion that the condor_now tool might help with this problem. I haven't used this tool myself, but from the documentationÂit looks like it replaces one currently running job from a scheduler with another idle job from the same scheduler. Essentially just reassigning the existing claim. This isn't ideal for our particular use-case because:
a) It requires the scheduler to have an existing claim to reuse, which may not be the case.
b) It requires the new job to have already been submitted and exist on the schedd. This adds some overhead time, but more importantly requires the submitting program to have to wait until the job is created before it can continue. (In testing for DAG jobs it took 30-40 seconds from submit time until the condor_dagman was running and had submitted the first job in the graph, and then an additional 30-60 seconds for that job to match and start running if resources were available.)
Our current solution to this uses the now-removed Compute On Demand (COD) functionality of HTCondor. I don't recommend doing this because it won't work in the current version (9.0+), but I think it demonstrates what we're trying to do. We're also "submitting" from our own Python API for creating submissions, which then sends the info to a service stack that eventually ends up doing the actual condor_submit_dag using the Python bindings. The specifics of how this part works aren't important, but if you want more details I did a presentation on it at HTCondor Week 2019 and the slides should be out there. The relevant part is that the user can pass an argument at submit time in the Python API which tells the submissions service to use COD instead of condor_submit_dag. When this happens the submissions service does roughly these steps:
- Builds a jobAd from the submission graph
- Get a list of startdAds for eligible slots from the Collector
- Iterate through the startdAds finding the best match
- Create a claim on the matching slot and activate the jobAd there
This glosses over some important details, like user permissions and tracking the claims, but for this request I want to focus on step 4. The core of this feature request is to provide a way of doing that step. Given a jobAd and a matching startdAd, run the job on the startd. This bypasses the scheduling system to start the job as soon as possible while still informing the Negotiator that the slot can't accept otherÂwork. The whole process typically takes less than a second from user submit to the job actually running. That said there are many downsides to this approach that limit its use to when the start time is absolutely critical:
- Since you're skipping the Negotiator you lose all of the benefits of why you're probably using HTCondor to begin with:
- No fair-share or group accounting
- Reimplementing sorting matches (luckily easy with the included Python bindings)
- You're also skipping the Schedd which means:
- No way of querying job info out-of-the-box
- No re-queuing the process, once it exits it's gone
- No handling of file transfer
- Can only claim entire slots, even if they're partitionable
So we use this sparingly, but it is important for the times it's needed.
My ideal solution would be something like adding a startJob(jobAd) function to the htcondor.Startd class. This would probably require the Startd object to have been initialized with a StartdPrivateÂad, and the calling user/host to have permissions in the matching ALLOW config entries. COD suspends currently running jobs on the slot when it's claimed, but I wouldn't mind if they were just vacated to make room.
With all of that out of the way, is anyone else out there running into a similar requirement? What was your solution? If you decided to use an entirely different system that doesn't interface with HTCondor how did that impact your batch pool? What would your dream solution be? Is there something else important I'm completely missing?
I'd love to hear from people.