[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to distribute jobs round robin



You could try using 

 NEGOTIATOR_PRE_JOB_RANK

to sort the machines by memory so that higher ranked machines are matched first.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Sent: Thursday, March 25, 2021 1:49 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin
 

Nevermind!

 

My current pool is very heterogeneous and the hosts being chosen are the ones with far more CPU’s so that’s why it’s looping around these two!

 

Any suggestions on how I could also take into consideration memory availability since I’m using partitionable slots?

 

Best regards,

Guilherme de Sousa Aranha

BANCO DE PORTUGAL
Departamento de Sistemas e Tecnologias de Informação / Systems and Information Technology Department
DSITI/ESA - Engenharia de Sistemas Aplicacionais


Rua Francisco Ribeiro, 2 | 1150-165 Lisboa
Ext. 20792
garanha@xxxxxxxxxxxx www.bportugal.pt

 

From: Guilherme De Sousa
Sent: 25 de março de 2021 18:46
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: How to distribute jobs round robin

 

Ok so after searching a bit more and changing the terminology from round robin to breadth-first (probably more accurate and correct) I found this:

 

https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-November/msg00032.shtml

 

which suggests:

NEGOTIATOR_PRE_JOB_RANK = 0

NEGOTIATOR_POST_JOB_RANK = +MY.Cpus

 

After applying this in my central manager and condor_reconfig the jobs are starting in new hosts even though they tend to loop between only 3 instead of starting in all the 9.

 

Can someone tell me if this is an acceptable approach? J

 

Best regards,

 

Guilherme de Sousa Aranha

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Guilherme De Sousa
Sent: 24 de março de 2021 19:06
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

 

 

 

I had a typo when copy pasted CLAIM_PARTITIONABLE_LEFTOVERSE (extra *E* at the end) instead of CLAIM_PARTITIONABLE_LEFTOVERS.

I also did condor_reconfig, but the jobs keep starting in wrk03.

 

I’m pretty sure they all match the jobs; example of a better-analyze:

 

1107.000:  Job is running.

 

Last successful match: Wed Mar 24 19:04:01 2021

 

 

1107.000:  Run analysis summary ignoring user priority.  Of 9 machines,

      0 are rejected by your job's requirements

      0 reject your job because of their own requirements

      0 match and are already running your jobs

      0 match but are serving other users

      9 are able to run your job

 

I’ve also started a few big jobs to get the wrk03 full and the last job started in a new host..

 

Best regards,

Guilherme de Sousa Aranha

BANCO DE PORTUGAL
Departamento de Sistemas e Tecnologias de Informação / Systems and Information Technology Department
DSITI/ESA - Engenharia de Sistemas Aplicacionais


Rua Francisco Ribeiro, 2 | 1150-165 Lisboa
Ext. 20792
garanha@xxxxxxxxxxxx www.bportugal.pt

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: 24 de março de 2021 18:11
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

 


ATENÇÃO: Este email é de origem externa. Tenha especial atenção a qualquer anexo ou hiperligação existente neste email.

 


Did you condor_reconfig after making the change?  

 

Are you sure that all of the machine can match the jobs?

 

-tj

 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Sent: Wednesday, March 24, 2021 12:32 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

 

Didn’t work either…

It still only starts jobs in wrk03 has you can see from condor_status

 

[root@srv-sub01 ~]# condor_status

Name                           OpSys      Arch   State     Activity LoadAv Mem      ActvtyTime

 

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 1031967112+02:41:53

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  193385114+04:00:43

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  131945114+04:05:09

slot1_1@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000   32768  2+06:59:04

slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  2+05:48:02

slot1_3@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  1+07:08:47

slot1_4@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  1+02:30:42

slot1_5@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  0+06:38:09

slot1_6@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  0+01:13:57

slot1_7@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  0+00:00:03

slot1_8@xxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      0.000    4096  0+00:00:03

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  515937117+02:51:10

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  515953117+02:34:03

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  515953112+02:43:40

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  515953117+03:00:16

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000  128617112+02:42:23

slot1@xxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 1031959112+02:47:07

 

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

 

  X86_64/LINUX       17     0       8         9       0          0      0

 

         Total       17     0       8         9       0          0      0

[root@srv-sub01 ~]#

 

 

Best regards,

Guilherme de Sousa Aranha

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller
Sent: 24 de março de 2021 17:03
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

 

 

When using partitionable slots,  The Schedd can start more than one job on a single partitionable slot for each match that it gets from the negotiator.  This leads to something that appears to be depth-first matching. 

 

If you configure 

 

  CLAIM_PARTITIONABLE_LEFTOVERSE = false

 

In the Schedd, then it will start only one job for each match it gets from the negotiator, and then your negotiator matching policy will have more traction. 

 

The downside of this is that it will take many more negotiation cycles for a Schedd to fill up a partitionable slot.  And if your machines are going to end up completely full anyway, this is wasted effort.

 

-tj

 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Guilherme De Sousa <garanha@xxxxxxxxxxxx>
Sent: Wednesday, March 24, 2021 8:43 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin

 

Hi Michael,

Thanks for the quick reply!

Unfortunately it didn't work.. the jobs are still being scheduled to a single machine until full.
I also checked the docs now for NEGOTIATOR_DEPTH_FIRST and the default is false but I set it explicitly anyway.

Best regards,

Guilherme de Sousa Aranha


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Michael Pelletier via HTCondor-users
Sent: 24 de março de 2021 13:26
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael Pelletier <michael.v.pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to distribute jobs round robin


See NEGOTIATOR_DEPTH_FIRST, which was introduced in version 8.8.2.

If you set it to false, you should see the behavior you're looking for.

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Digital Technology
HPC Support Team

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwICAg&c=WdwKmQaPYCZq3ZY-wllUZB0L-BOaCTNMIdrWHq8UZ_4&r=TsFqz1fYO3UwE6LUWx2K2T75_Pte5lPZcaUk-Bn-AoA&m=p-SVkyR9PR-zgjqL9q1NtrMhHT8omfNkHM_CVaJ5P_Y&s=x2OH3o-rUaqn-hfWrOcr8hSFrtXoxtCcz71CUKsfey4&e=

The archives can be found at:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwICAg&c=WdwKmQaPYCZq3ZY-wllUZB0L-BOaCTNMIdrWHq8UZ_4&r=TsFqz1fYO3UwE6LUWx2K2T75_Pte5lPZcaUk-Bn-AoA&m=p-SVkyR9PR-zgjqL9q1NtrMhHT8omfNkHM_CVaJ5P_Y&s=JSs5VBHIE8v3ZllCXjEy5sydulgpF8rBOkxbczoF3bo&e=

______________________________________________________________________
Este e-mail dirige-se apenas aos destinatários acima indicados, sendo proibida a sua divulgação, total ou parcial, ou o uso ou reenvio não autorizados. Se recebeu este e-mail por engano, por favor notifique o remetente imediatamente via e-mail e exclua-o do seu sistema.
O Banco de Portugal trata os dados pessoais de acordo com os princípios e regras decorrentes da legislação europeia e nacional, em especial do Regulamento (UE) 2016/679, do Parlamento Europeu e do Conselho, de 27 de abril de 2016. Para mais informações consulte a Página do Banco de Portugal sobre proteção de dados. Em caso de dúvidas, pode contactar o Encarregado da Proteção de Dados para o seguinte e-mail: (encarregado.protecao.dados@xxxxxxxxxxxx). Pode também consultar a Autoridade Nacional da Proteção de Dados.

This e-mail is intended only for the use of the recipient(s) named above. Any unauthorised disclosure use or dissemination, either in whole or in part, is prohibited. If you have received this e-mail in error, please notify the sender immediately via e-mail and delete this e-mail from your system.
Banco de Portugal processes personal data in line with the principles and rules in European and national legislation, in particular Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016. For further information, see Banco de Portugal’s webpage on data protection. In case of queries, please contact Banco de Portugal’s Data Protection Officer (encarregado.protecao.dados@xxxxxxxxxxxx). You may also contact the Portuguese Data Protection Authority (Comissão Nacional de Proteção de Dados).

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/