Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly

Date: Sat, 12 Jun 2021 01:07:45 +0300 (MSK)
From: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly

Jason, 

thank you, reducing UNUSED_CLAIM_TIMEOUT indeed speeds the things up. 

My idea is to release all idle claims before the next negotiation cycle occurs. During the next cycle the first job will be able to get all available resources and run if it is enough. 

Unfortunately now I again run into the issue when there are all necessary resources available, required dynamic slots are created and claimed by the job but job cannot start.

It seems this is a bug in HTCondor because if I lower job requirements I can see it creates one extra slot and then my job runs.


So if I submit a parallel job of two job with requirements cpu=25, memory=256 and cpu=100, memory=960 the job cannot run and slots configuration looks like this:

Name                                         OpSys State     Activity Cpus Memory
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     0    840
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     100  960
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     75   1544
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     25   256

Now if I lower the requirements for the second job to cpu=75, memory=960 then first the slots are the following:

Name                                         OpSys State     Activity Cpus Memory
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     100  1800
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     0    584
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     25   256
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     75   960

and job doesn't start, but quickly (after next negotiation cycle?) there is one extra slot with the same resources cpu=75, memory=960:

Name                                         OpSys State     Activity Cpus Memory
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     25   840
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     75   960
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     0    584
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     25   256
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     75   960

and now the job runs.



Best regards,
Stanislav Markevich



----- Original Message -----
From: "Jason Patton" <jpatton@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Friday, 11 June, 2021 21:21:18
Subject: Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly

Stanislav, 
You are running across one of the difficulties with using the dedicated scheduler and parallel universe jobs in that, because all jobs in a parallel universe job cluster must run at the same time, resources will sit claimed and idle until all of the needed resources are claimed for that job, and then also for some time after the job has finished. The default is for the dedicated scheduler to hold on to idle claimed resources for 10 minutes, as you have noted. This can be annoying when using partitionable slots and jobs of varying resource requests because a "big" dynamic slot that was used for a parallel universe job might stick around idle for those 10 minutes. 

I'm not sure if this helps you accomplish exactly what you're asking for, but you can reduce the time a dedicated scheduler is allowed to keep an idle slot claimed by reducing UNUSED_CLAIM_TIMEOUT in the condor config of the machine hosting the dedicated scheduler. Here's the description from our condor_config.local.dedicated.submit example: 

## If the dedicated scheduler has resources claimed, but nothing to 
## use them for (no MPI jobs in the queue that could use them), how 
## long should it hold onto them before releasing them back to the 
## regular Condor pool? Specified in seconds. Default is 10 minutes. 
## If you define this to '0', the schedd will never release claims 
## (unless the schedd is shutdown). If your dedicated resources are 
## configured to only run jobs, you should probably set this attribute 
## to '0' 
#UNUSED_CLAIM_TIMEOUT = 600 

Jason Patton 

On Fri, Jun 11, 2021 at 8:36 AM Stanislav V. Markevich via HTCondor-users < [ mailto:htcondor-users@xxxxxxxxxxx | htcondor-users@xxxxxxxxxxx ] > wrote: 


Hi! 

I have a problem with partitionable slots and parallel jobs. 
Jobs are so large that two jobs cannot run simultaneously. Jobs have different requirements so dynamic slots created for one job are rarely suitable for another one. 

To speedup creating of dynamic slots I set the following parameters: 
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE 
CONSUMPTION_POLICY = TRUE 

When the number of running jobs and jobs in the queue is low everything is fine, HTCondor creates dynamic slots, runs jobs, and deletes dynamic slots after they were used. 

But as the number of jobs in queue grows running becomes slower and slower. There are big intervals between one job finished and next job started (10-30 minutes and more). 
At some moment HTCondor may stop running jobs completely (for hours). I see that dynamic slots are being created and claimed by different jobs and then released after 10 minutes of inactivity. 
No job can get required amount of slots to run. 

Is there any solution for this? 
Is it possible to tell HTCondor not to try to match multiple jobs at a time, just match first all slots for first job, run it and only then to process next job? 

Thanks. 


Best regards, 
Stanislav V. Markevich 
_______________________________________________ 
HTCondor-users mailing list 
To unsubscribe, send a message to [ mailto:htcondor-users-request@xxxxxxxxxxx | htcondor-users-request@xxxxxxxxxxx ] with a 
subject: Unsubscribe 
You can also unsubscribe by visiting 
[ https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users | https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users ] 

The archives can be found at: 
[ https://lists.cs.wisc.edu/archive/htcondor-users/ | https://lists.cs.wisc.edu/archive/htcondor-users/ ] 




_______________________________________________ 
HTCondor-users mailing list 
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a 
subject: Unsubscribe 
You can also unsubscribe by visiting 
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users 

The archives can be found at: 
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly
  - From: Stanislav V. Markevich

References:
- Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly
  - From: Jason Patton

Prev by Date: Re: [HTCondor-users] Dagman Issue V9
Next by Date: Re: [HTCondor-users] Dagman Issue V9
Previous by thread: Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly
Next by thread: Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] parallel jobs and partitionable slots - jobs run very slowly