[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE: Setting Default limits



Yes, that should work.  
I would recommend that you use EVALSET rather than SET in that transform, but it will work either way.

-tj

From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Sent: Friday, September 3, 2021 8:44 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; John M Knoeller <johnkn@xxxxxxxxxxx>; Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] HTCondor-CE: Setting Default limits
 
Thank You John,

In the meanwhile I have set the following:

JOB_ROUTER_TRANSFORM_NumCores @=jrt
  REQUIREMENTS MY.xcount > 1 || MY.RequestCpus > 1 || False
  SET RequestCpus max({$(MY.xcount:1),$(MY.RequestCpus:1)})
@jrt

JOB_ROUTER_PRE_ROUTE_TRANSFORM_NAMES = $(JOB_ROUTER_PRE_ROUTE_TRANSFORM_NAMES) NumCores

Which seems to behave as expected for all combinations of xcount and RequestCpus
in the submit file.

Current LHC VOs use to set (or not) xcount and RequestCpus
this way:

VO  xcount  RequestCpus
alice 8 8
alice undefined 1
atlas 1 1
atlas 8 1
atlas undefined 1
cms 8 1
cms undefined 1
lhcb undefined 1

that means that using the greatest defined value from xcount and RequestCpus
should always work

Stefano


On 03/09/21 00:07, John M Knoeller wrote:
Hi Stefano.    

Currently when using the new-transform syntax,  the incoming job's RequestCpus (and RequestMemory) are honored
and default_xcount from the route has no effect unless the incoming job does not have a value for RequestCpus.

We have heard several reports from the field that while incoming jobs have values for RequestCpus (and RequestMemory) that were defaulted by condor_submit, those defaults are wrong, and CE administrators would prefer to have default_xcount override the incoming job values.  

So we have decided to change the CE configuration in the next update so that 
    RequestCpus from the incoming job will be ignored unless it is > 1
    RequestMemory from the incoming job will always be ignored

The ticket for this work is here

As a workaround for now.  If you set default_xcount, and also orig_RequestCpus in the route you will get the behavior you expect now, and it will still work as you expect after the above fix is released.

At some point in the (probably distant) future, we would like to go back to assuming that incoming jobs have correct values for resource requests, since that is the "correct" thing to do for HTCondor.  But clearly, we were premature in making that change today. 

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Sent: Thursday, September 2, 2021 10:22 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] HTCondor-CE: Setting Default limits
 
Hi Thomas,

thank You for your suggestions.

As for my Q2 from previous email, my doubts comes from the manual statement:
https://htcondor.github.io/htcondor-ce/v5/configuration/writing-job-routes/#setting-a-default
<< To set a default number of cores for routed jobs, set the variable or attribute default_xcount >>...

So, having xcount undefined on the routed job could ok, however my routed job start as singlecore,
having   CpusProvisioned = 1  and  requestcpus = 1
and that is unexpected. My naive interpretation is that the "default_xcount = 4" in the JOB_ROUTER_ROUTE_<name>
entry should trigger a "job transform" in the CE schedd having the final effect of
setting requestcpus = 4 on the job.ad of the routed job, but this is not happening.

Stefano


On 01/09/21 10:54, Thomas Hartmann wrote:
Hi Stefano,

for Q1 maybe the quantize() macro might be useful

set_MyDefaultMemPerCore = 3000
set_MyMemScaling = xcount * MyDefaultMemPerCore
set_TmpScaledMem = quantize(RequestMemory,MyMemScaling)

but I am unsure, if it would catch highmem jobs reasonably (might be vice versa necessary to scale the core count up, if the original mem per core request exceeds your defaults)

---

For Q2 my interpretation is, that the xcount reflects in
  OriginalCpus = 4
since the xcount ad is AFAIK only something CE internal and gets copied over to the RequestCpus & OriginalCpus ads

But maybe you can check, if your route got actually applied to your job?
E.g., we set a few defaults with [1] - note that the ad is added to JOB_ROUTER_DEFAULTS (the route has not been touched since CE4 and is in the []-syntax)

For specific rules like [2], it might be best for testing to always include a Requirements rule to distinguish which route a job takes and add the route to JOB_ROUTE_NAMES/JOB_ROUTE_ENTRIES.
I prefer also adding a 'tag' like "DESYROUTEPRIO" to routes so that I can easier identify where a job went.

Cheers,
  Thomas


[1]
MERGE_JOB_ROUTER_DEFAULT_ADS=True
DESYDEFAULTS @=end
[
 set_DESYDEFAULTSSET =  True;

 set_default_xcount = 1;
 set_default_maxWallTime = 5760;
 set_default_maxMemory = 2048;

 set_requirements= ...
]
@end

JOB_ROUTER_DEFAULTS = $(JOB_ROUTER_DEFAULTS) $(DESYDEFAULTS)



[2]
DESYPRIO @=end
[
  TargetUniverse = 5;
  name = "DESYPRIO";
  set_DESYROUTEPRIO = True;

  Requirements = x509UserProxyVOName =?= "ops" ... ;

  # some more ads

]
@end

JOB_ROUTER_ENTRIES = $(JOB_ROUTER_ENTRIES) $(DESYPRIO)
JOB_ROUTE_NAMES = $(JOB_ROUTE_NAMES) $(DESYPRIO)


On 31/08/2021 18.08, Stefano Dal Pra wrote:
Hello,

i'm working to configure a htcondor-ce 5.1 and have a few doubts on how to properly set default job limits.

I'm following the examples from here:
https://htcondor.github.io/htcondor-ce/v5/configuration/writing-job-routes/
such as this one:

|JOB_ROUTER_ROUTE_Condor_Pool @=jrt UNIVERSE VANILLA # Set the requested memory to 1 GB default_maxMemory = 1000 @jrt JOB_ROUTER_ROUTE_NAMES = Condor_Pool|


Q1: Is it possible to set default_maxMemory to a value proportional to RequestCpus of the incoming job? i.e.
something like

default_maxMemory = $(RequestCpus:1) * 3000

Q2: I applied the following defaults:

JOB_ROUTER_ROUTE_t1_defaults  @=jrt
    UNIVERSE VANILLA
    default_xcount = 4
    default_maxMemory = 4321
    default_maxWallTime = 61
@jrt


  But I'm a bit confused with the overall results:

0) I submit a minimal test job:
[sdalpra@ui-htc htjobs]$ condor_submit -pool ce01t-htc.cr.cnaf.infn.it:9619 -remote ce01t-htc.cr.cnaf.infn.it ce_testp308.sub
Submitting job(s).
1 job(s) submitted to cluster 610.

1) The job is routed
[root@ce01t-htc ~]# condor_ce_q 610. -af routedtojobid
8428.0

2) I check classads from the routed job

[root@ce01t-htc ~]# condor_q 8428.0 -af:jln jobstatus CpusProvisioned xcount requestcpus OriginalCpus remote_NodeNumber remote_SMPGranularity BatchRuntime OriginalMemory remote_OriginalMemory OriginalCpus remote_NodeNumber remote_SMPGranularity
ID = 8428.0
  jobstatus = 2
  CpusProvisioned = 1
  xcount = undefined
  requestcpus = 1
  OriginalCpus = 4
  remote_NodeNumber = 4
  remote_SMPGranularity = 4
  BatchRuntime = 3660
  OriginalMemory = 4321
  remote_OriginalMemory = 4321
  OriginalCpus = 4
  remote_NodeNumber = 4
  remote_SMPGranularity = 4


So this is where i'm puzzled:
- I would expect to see  xcount = 4 but it is undefined instead.
- The running job reports CpusProvisioned = 1, and that makes me think that
remote_NodeNumber = 4, remote_SMPGranularity = 4, OriginalCpus = 4
are somehow ignored.
- BatchRuntime is there, with the proper value set as expected (61 * 60) however i'm not sure on the meaning.
The htcondor manual says: << For *batch* grid universe jobs, a limit in seconds on the job’s execution time, enforced by the remote batch system.>> who is "remote" in this context? Does that mean that condor-ce would stop the running routed job after 61 minutes? Moreover,
we have here a Vanilla universe job, at both CE and batch side:

[root@ce01t-htc ~]# condor_ce_q 610. -l | grep -i univer
JobUniverse = 5

[root@ce01t-htc ~]# condor_q -l 8428.0 | grep -i univer
JobUniverse = 5
Remote_JobUniverse = 5

Thanks for any comment
Stefano



||

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/