[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Number of jobs that are started per negotiator cycle



Hi Todd,

Thank you very much for looking at this. We will try to simplify our configuration and test again, running "condor_config_val -summary" there are a lot of customized changes that come from our production environment that has no sense in this small test.

Cheers,

Carles

On Fri, 3 Jun 2022 at 21:10, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 5/27/2022 8:08 AM, Carles Acosta wrote:
Dear all,

We have a test execute machine with 100 partitionable slots of 48 cores and 2 GB RAM per core. Everything is fake, just for testing purposes and running sleep commands. We are using HTCondor 9.0.12 in this test environment.Â

We are doing differentÂtests submitting a batch of 100 jobs and changing the number of requested cpus to this execute machine. When we submit 100 jobs requesting 48 cores, the 100 jobs start in the first negotiator cycle. When we submit 100 jobs of 24 cores, 75 jobs start in the first cycle, with 8 cores, 31 jobs in the first cycle, and with 1 core only 6 jobs start in the first cycle.

We have been looking in the manual for negotiator, schedd, startd, or any configuration variable that explains this behavior but we were not lucky. Is there any way to enforce for instance that all the jobs enter in the first negotiator cycle even if requesting one CPU? Our guess is that there are some configuration and timeouts regarding the creation of the dynamic slots, etc., that are affecting this case, or maybe this is related to the auto clustering on the negotiator side?

Thank you in advance and have a nice weekend!

Carles

Hi Carles,

Something certainly seems amiss in your setup, or your test execute machine is having resource contention problems...

I tried to reproduce your environment by installing minicondor v9.0.13 into a Centos7 docker container (with 6 cores and 12GB ram), configured the startd with 100 pslots of 48cores each, and submitting 100 jobs each requesting 1 core. In my test, all 100 jobs got matched in the first negotiation cycle. Details of how I did my test are below (*).

In my test, almost all configuration knobs were just using the default settings. It may help to run "condor_config_val -summary" on your central manager and perhaps your access point (submit machine). This command will output all your customized config changes, i.e. settings that have been customized away from the default settings. This may give a clue. If you are willing to share the output of this command here (maybe sanitize hostnames if desired), we could also look it over for anything suspicious.

Hope this helps,
Todd

(*) Pithy testing procedure I tried to reproduce the problem:

( fire up a minimal empty Centos 7 container for testing; all subsequent commands are in the container )

$ docker run --rm -it centos:7

( next install minicondor from the stable (v9.0.x) channel in the container )

# curl -fsSL https://get.htcondor.org | /bin/bash -s -- --no-dry-run --channel stable

( configure startd with 100 48-core pslots, and set negotiator_interval to be huge so
 only one negotiation cycle will take place after submitting jobs )

# cat - > /etc/condor/config.d/05-test.conf
NUM_CPUS = 4800
MEMORY = 100 * 2048
SLOT_TYPE_1 = memory=2048, cpus=48
SLOT_TYPE_1_PARTITIONABLE = true
NUM_SLOTS_TYPE_1 = 100
NEGOTIATOR_INTERVAL = 5000
<CTRL-D>

(start up htcondor)

# condor_master

(become a non-root user to submit test sleep jobs)

# adduser tannenba

# su - tannenba

(submit 100 1-core sleep jobs)

$ condor_submit executable=/usr/bin/sleep arguments=120 request_cpus=1 -queue 100
Submitting job(s)....................................................................................................
100 job(s) submitted to cluster 3.

(after a few seconds and one negotiation cycle, all jobs are running... )

$ condor_q


-- Schedd: a5896b768045 : <127.0.0.1:9618?... @ 05/27/22 18:18:17
OWNERÂÂÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂ TOTAL JOB_IDS
tannenba ID: 3ÂÂÂÂÂÂÂ 5/27 18:18ÂÂÂÂÂ _ÂÂÂ 100ÂÂÂÂÂ _ÂÂÂ 100 3.0-99

Total for query: 100 jobs; 0 completed, 0 removed, 0 idle, 100 running, 0 held, 0 suspended
Total for tannenba: 100 jobs; 0 completed, 0 removed, 0 idle, 100 running, 0 held, 0 suspended
Total for all users: 100 jobs; 0 completed, 0 removed, 0 idle, 100 running, 0 held, 0 suspended




--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es