[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly



Stanislav,

You are running across one of the difficulties with using the dedicated scheduler and parallel universe jobs in that, because all jobs in a parallel universe job cluster must run at the same time, resources will sit claimed and idle until all of the needed resources are claimed for that job, and then also for some time after the job has finished. The default is for the dedicated scheduler to hold on to idle claimed resources for 10 minutes, as you have noted. This can be annoying when using partitionable slots and jobs of varying resource requests because a "big" dynamic slot that was used for a parallel universe job might stick around idle for those 10 minutes.

I'm not sure if this helps you accomplish exactly what you're asking for, but you can reduce the time a dedicated scheduler is allowed to keep an idle slot claimed by reducing UNUSED_CLAIM_TIMEOUT in the condor config of the machine hosting the dedicated scheduler. Here's the description from our condor_config.local.dedicated.submit example:

## If the dedicated scheduler has resources claimed, but nothing to
## use them for (no MPI jobs in the queue that could use them), how
## long should it hold onto them before releasing them back to the
## regular Condor pool?  Specified in seconds.  Default is 10 minutes.
## If you define this to '0', the schedd will never release claims
## (unless the schedd is shutdown).  If your dedicated resources are
## configured to only run jobs, you should probably set this attribute
## to '0'
#UNUSED_CLAIM_TIMEOUT = 600

Jason Patton

On Fri, Jun 11, 2021 at 8:36 AM Stanislav V. Markevich via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi!

I have a problem with partitionable slots and parallel jobs.
Jobs are so large that two jobs cannot run simultaneously. Jobs have different requirements so dynamic slots created for one job are rarely suitable for another one.

To speedup creating of dynamic slots I set the following parameters:
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
CONSUMPTION_POLICY = TRUE

When the number of running jobs and jobs in the queue is low everything is fine, HTCondor creates dynamic slots, runs jobs, and deletes dynamic slots after they were used.

But as the number of jobs in queue grows running becomes slower and slower. There are big intervals between one job finished and next job started (10-30 minutes and more).
At some moment HTCondor may stop running jobs completely (for hours). I see that dynamic slots are being created and claimed by different jobs and then released after 10 minutes of inactivity.
No job can get required amount of slots to run.

Is there any solution for this?
Is it possible to tell HTCondor not to try to match multiple jobs at a time, just match first all slots for first job, run it and only then to process next job?

Thanks.


Best regards,
Stanislav V. Markevich
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/