[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE: Setting Default limits



Hi Stefano,

for Q1 maybe the quantize() macro might be useful

set_MyDefaultMemPerCore = 3000
set_MyMemScaling = xcount * MyDefaultMemPerCore
set_TmpScaledMem = quantize(RequestMemory,MyMemScaling)

but I am unsure, if it would catch highmem jobs reasonably (might be vice versa necessary to scale the core count up, if the original mem per core request exceeds your defaults)

---

For Q2 my interpretation is, that the xcount reflects in
  OriginalCpus = 4
since the xcount ad is AFAIK only something CE internal and gets copied over to the RequestCpus & OriginalCpus ads

But maybe you can check, if your route got actually applied to your job?
E.g., we set a few defaults with [1] - note that the ad is added to JOB_ROUTER_DEFAULTS (the route has not been touched since CE4 and is in the []-syntax)

For specific rules like [2], it might be best for testing to always include a Requirements rule to distinguish which route a job takes and add the route to JOB_ROUTE_NAMES/JOB_ROUTE_ENTRIES. I prefer also adding a 'tag' like "DESYROUTEPRIO" to routes so that I can easier identify where a job went.

Cheers,
  Thomas


[1]
MERGE_JOB_ROUTER_DEFAULT_ADS=True
DESYDEFAULTS @=end
[
 set_DESYDEFAULTSSET =  True;

 set_default_xcount = 1;
 set_default_maxWallTime = 5760;
 set_default_maxMemory = 2048;

 set_requirements= ...
]
@end

JOB_ROUTER_DEFAULTS = $(JOB_ROUTER_DEFAULTS) $(DESYDEFAULTS)



[2]
DESYPRIO @=end
[
  TargetUniverse = 5;
  name = "DESYPRIO";
  set_DESYROUTEPRIO = True;

  Requirements = x509UserProxyVOName =?= "ops" ... ;

  # some more ads

]
@end

JOB_ROUTER_ENTRIES = $(JOB_ROUTER_ENTRIES) $(DESYPRIO)
JOB_ROUTE_NAMES = $(JOB_ROUTE_NAMES) $(DESYPRIO)


On 31/08/2021 18.08, Stefano Dal Pra wrote:
Hello,

i'm working to configure a htcondor-ce 5.1 and have a few doubts on how to properly set default job limits.

I'm following the examples from here:
https://htcondor.github.io/htcondor-ce/v5/configuration/writing-job-routes/
such as this one:

|JOB_ROUTER_ROUTE_Condor_Pool @=jrt UNIVERSE VANILLA # Set the requested memory to 1 GB default_maxMemory = 1000 @jrt JOB_ROUTER_ROUTE_NAMES = Condor_Pool|


Q1: Is it possible to set default_maxMemory to a value proportional to RequestCpus of the incoming job? i.e.
something like

default_maxMemory = $(RequestCpus:1) * 3000

Q2: I applied the following defaults:

JOB_ROUTER_ROUTE_t1_defaults @=jrt
 ÂÂ UNIVERSE VANILLA
 ÂÂ default_xcount = 4
 ÂÂ default_maxMemory = 4321
 ÂÂ default_maxWallTime = 61
@jrt


 ÂBut I'm a bit confused with the overall results:

0) I submit a minimal test job:
[sdalpra@ui-htc htjobs]$ condor_submit -pool ce01t-htc.cr.cnaf.infn.it:9619 -remote ce01t-htc.cr.cnaf.infn.it ce_testp308.sub
Submitting job(s).
1 job(s) submitted to cluster 610.

1) The job is routed
[root@ce01t-htc ~]# condor_ce_q 610. -af routedtojobid
8428.0

2) I check classads from the routed job

[root@ce01t-htc ~]# condor_q 8428.0 -af:jln jobstatus CpusProvisioned xcount requestcpus OriginalCpus remote_NodeNumber remote_SMPGranularity BatchRuntime OriginalMemory remote_OriginalMemory OriginalCpus remote_NodeNumber remote_SMPGranularity
ID = 8428.0
 Âjobstatus = 2
 ÂCpusProvisioned = 1
 Âxcount = undefined
 Ârequestcpus = 1
 ÂOriginalCpus = 4
 Âremote_NodeNumber = 4
 Âremote_SMPGranularity = 4
 ÂBatchRuntime = 3660
 ÂOriginalMemory = 4321
 Âremote_OriginalMemory = 4321
 ÂOriginalCpus = 4
 Âremote_NodeNumber = 4
 Âremote_SMPGranularity = 4


So this is where i'm puzzled:
- I would expect to see xcount = 4 but it is undefined instead.
- The running job reports CpusProvisioned = 1, and that makes me think that
remote_NodeNumber = 4, remote_SMPGranularity = 4, OriginalCpus = 4
are somehow ignored.
- BatchRuntime is there, with the proper value set as expected (61 * 60) however i'm not sure on the meaning. The htcondor manual says: << For *batch* grid universe jobs, a limit in seconds on the jobâs execution time, enforced by the remote batch system.>> who is "remote" in this context? Does that mean that condor-ce would stop the running routed job after 61 minutes? Moreover,
we have here a Vanilla universe job, at both CE and batch side:

[root@ce01t-htc ~]# condor_ce_q 610. -l | grep -i univer
JobUniverse = 5

[root@ce01t-htc ~]# condor_q -l 8428.0 | grep -i univer
JobUniverse = 5
Remote_JobUniverse = 5

Thanks for any comment
Stefano



||

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature