[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE: Setting Default limits



Hi Thomas,

thank You for your suggestions.

As for my Q2 from previous email, my doubts comes from the manual statement:
https://htcondor.github.io/htcondor-ce/v5/configuration/writing-job-routes/#setting-a-default
<< To set a default number of cores for routed jobs, set the variable or attribute default_xcount >>...

So, having xcount undefined on the routed job could ok, however my routed job start as singlecore,
having  CpusProvisioned = 1 and requestcpus = 1
and that is unexpected. My naive interpretation is that the "default_xcount = 4" in the JOB_ROUTER_ROUTE_<name>
entry should trigger a "job transform" in the CE schedd having the final effect of
setting requestcpus = 4 on the job.ad of the routed job, but this is not happening.

Stefano


On 01/09/21 10:54, Thomas Hartmann wrote:
Hi Stefano,

for Q1 maybe the quantize() macro might be useful

set_MyDefaultMemPerCore = 3000
set_MyMemScaling = xcount * MyDefaultMemPerCore
set_TmpScaledMem = quantize(RequestMemory,MyMemScaling)

but I am unsure, if it would catch highmem jobs reasonably (might be vice versa necessary to scale the core count up, if the original mem per core request exceeds your defaults)

---

For Q2 my interpretation is, that the xcount reflects in
 OriginalCpus = 4
since the xcount ad is AFAIK only something CE internal and gets copied over to the RequestCpus & OriginalCpus ads

But maybe you can check, if your route got actually applied to your job?
E.g., we set a few defaults with [1] - note that the ad is added to JOB_ROUTER_DEFAULTS (the route has not been touched since CE4 and is in the []-syntax)

For specific rules like [2], it might be best for testing to always include a Requirements rule to distinguish which route a job takes and add the route to JOB_ROUTE_NAMES/JOB_ROUTE_ENTRIES.
I prefer also adding a 'tag' like "DESYROUTEPRIO" to routes so that I can easier identify where a job went.

Cheers,
 Thomas


[1]
MERGE_JOB_ROUTER_DEFAULT_ADS=True
DESYDEFAULTS @=end
[
Âset_DESYDEFAULTSSET =Â True;

Âset_default_xcount = 1;
Âset_default_maxWallTime = 5760;
Âset_default_maxMemory = 2048;

Âset_requirements= ...
]
@end

JOB_ROUTER_DEFAULTS = $(JOB_ROUTER_DEFAULTS) $(DESYDEFAULTS)



[2]
DESYPRIO @=end
[
 TargetUniverse = 5;
 name = "DESYPRIO";
 set_DESYROUTEPRIO = True;

 Requirements = x509UserProxyVOName =?= "ops" ... ;

 # some more ads

]
@end

JOB_ROUTER_ENTRIES = $(JOB_ROUTER_ENTRIES) $(DESYPRIO)
JOB_ROUTE_NAMES = $(JOB_ROUTE_NAMES) $(DESYPRIO)


On 31/08/2021 18.08, Stefano Dal Pra wrote:
Hello,

i'm working to configure a htcondor-ce 5.1 and have a few doubts on how to properly set default job limits.

I'm following the examples from here:
https://htcondor.github.io/htcondor-ce/v5/configuration/writing-job-routes/
such as this one:

|JOB_ROUTER_ROUTE_Condor_Pool @=jrt UNIVERSE VANILLA # Set the requested memory to 1 GB default_maxMemory = 1000 @jrt JOB_ROUTER_ROUTE_NAMES = Condor_Pool|


Q1: Is it possible to set default_maxMemory to a value proportional to RequestCpus of the incoming job? i.e.
something like

default_maxMemory = $(RequestCpus:1) * 3000

Q2: I applied the following defaults:

JOB_ROUTER_ROUTE_t1_defaults @=jrt
ÂÂÂ UNIVERSE VANILLA
ÂÂÂ default_xcount = 4
ÂÂÂ default_maxMemory = 4321
ÂÂÂ default_maxWallTime = 61
@jrt


ÂÂBut I'm a bit confused with the overall results:

0) I submit a minimal test job:
[sdalpra@ui-htc htjobs]$ condor_submit -pool ce01t-htc.cr.cnaf.infn.it:9619 -remote ce01t-htc.cr.cnaf.infn.it ce_testp308.sub
Submitting job(s).
1 job(s) submitted to cluster 610.

1) The job is routed
[root@ce01t-htc ~]# condor_ce_q 610. -af routedtojobid
8428.0

2) I check classads from the routed job

[root@ce01t-htc ~]# condor_q 8428.0 -af:jln jobstatus CpusProvisioned xcount requestcpus OriginalCpus remote_NodeNumber remote_SMPGranularity BatchRuntime OriginalMemory remote_OriginalMemory OriginalCpus remote_NodeNumber remote_SMPGranularity
ID = 8428.0
ÂÂjobstatus = 2
ÂÂCpusProvisioned = 1
ÂÂxcount = undefined
ÂÂrequestcpus = 1
ÂÂOriginalCpus = 4
ÂÂremote_NodeNumber = 4
ÂÂremote_SMPGranularity = 4
ÂÂBatchRuntime = 3660
ÂÂOriginalMemory = 4321
ÂÂremote_OriginalMemory = 4321
ÂÂOriginalCpus = 4
ÂÂremote_NodeNumber = 4
ÂÂremote_SMPGranularity = 4


So this is where i'm puzzled:
- I would expect to see xcount = 4 but it is undefined instead.
- The running job reports CpusProvisioned = 1, and that makes me think that
remote_NodeNumber = 4, remote_SMPGranularity = 4, OriginalCpus = 4
are somehow ignored.
- BatchRuntime is there, with the proper value set as expected (61 * 60) however i'm not sure on the meaning.
The htcondor manual says: << For *batch* grid universe jobs, a limit in seconds on the jobâs execution time, enforced by the remote batch system.>> who is "remote" in this context? Does that mean that condor-ce would stop the running routed job after 61 minutes? Moreover,
we have here a Vanilla universe job, at both CE and batch side:

[root@ce01t-htc ~]# condor_ce_q 610. -l | grep -i univer
JobUniverse = 5

[root@ce01t-htc ~]# condor_q -l 8428.0 | grep -i univer
JobUniverse = 5
Remote_JobUniverse = 5

Thanks for any comment
Stefano



||

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/