Re: [HTCondor-users] parallel jobs and partitionable slots

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Dear Jason,

I have the same issue with htcondor with dynamic slots and parallel job execution. Everything goes fine at first after jobs submit, htcondor creates dynamic slots, claims slots, runs jobs one by one, but after some time (1-5 minutes) it gets stuck. Means it has jobs in the queue and claimed slots but nothing happens. The first strange thing I have faced, I set CLAIM_WORKLIFE = 0 and expected that the htcondor will release the slot after each job run, but sometimes it kept claimed slots and release only after UNUSED_CLAIM_TIMEOUT. Is it a bug? Second strange, I set UNUSED_CLAIM_TIMEOUT = 30, and I see that htcondor creates dynamic slots, then releases slots after UNUSED_CLAIM_TIMEOUT, then creates again, then release, etc ... many times without any run. It can take 30 minutes just to run one job... i.e. the current state is htcondor cluster is free, big list of jobs in the queue and htcondor runs one by one with about 30 minutes (not often) interval. Or it is by design for dynamic slots?

Best regards,
Dmitry.

From: "Jason Patton" <jpatton@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Friday, June 11, 2021 9:21:18 PM
Subject: Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly

Stanislav,

You are running across one of the difficulties with using the dedicated scheduler and parallel universe jobs in that, because all jobs in a parallel universe job cluster must run at the same time, resources will sit claimed and idle until all of the needed resources are claimed for that job, and then also for some time after the job has finished. The default is for the dedicated scheduler to hold on to idle claimed resources for 10 minutes, as you have noted. This can be annoying when using partitionable slots and jobs of varying resource requests because a "big" dynamic slot that was used for a parallel universe job might stick around idle for those 10 minutes.

I'm not sure if this helps you accomplish exactly what you're asking for, but you can reduce the time a dedicated scheduler is allowed to keep an idle slot claimed by reducing UNUSED_CLAIM_TIMEOUT in the condor config of the machine hosting the dedicated scheduler. Here's the description from our condor_config.local.dedicated.submit example:

## If the dedicated scheduler has resources claimed, but nothing to
## use them for (no MPI jobs in the queue that could use them), how
## long should it hold onto them before releasing them back to the
## regular Condor pool? Specified in seconds. Default is 10 minutes.
## If you define this to '0', the schedd will never release claims
## (unless the schedd is shutdown). If your dedicated resources are
## configured to only run jobs, you should probably set this attribute
## to '0'
#UNUSED_CLAIM_TIMEOUT = 600

Jason Patton

On Fri, Jun 11, 2021 at 8:36 AM Stanislav V. Markevich via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Hi!

I have a problem with partitionable slots and parallel jobs.
Jobs are so large that two jobs cannot run simultaneously. Jobs have different requirements so dynamic slots created for one job are rarely suitable for another one.

To speedup creating of dynamic slots I set the following parameters:
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
CONSUMPTION_POLICY = TRUE

When the number of running jobs and jobs in the queue is low everything is fine, HTCondor creates dynamic slots, runs jobs, and deletes dynamic slots after they were used.

But as the number of jobs in queue grows running becomes slower and slower. There are big intervals between one job finished and next job started (10-30 minutes and more).
At some moment HTCondor may stop running jobs completely (for hours). I see that dynamic slots are being created and claimed by different jobs and then released after 10 minutes of inactivity.
No job can get required amount of slots to run.

Is there any solution for this?
Is it possible to tell HTCondor not to try to match multiple jobs at a time, just match first all slots for first job, run it and only then to process next job?

Thanks.

Best regards,
Stanislav V. Markevich
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly