[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE: Setting Default limits



Hello,

i'm working to configure a htcondor-ce 5.1 and have a few doubts on how to properly set default job limits.

I'm following the examples from here:
https://htcondor.github.io/htcondor-ce/v5/configuration/writing-job-routes/
such as this one:
JOB_ROUTER_ROUTE_Condor_Pool @=jrt
  UNIVERSE VANILLA
  # Set the requested memory to 1 GB
  default_maxMemory = 1000
@jrt

JOB_ROUTER_ROUTE_NAMES = Condor_Pool


Q1: Is it possible to set default_maxMemory to a value proportional to RequestCpus of the incoming job? i.e.
something like

default_maxMemory = $(RequestCpus:1) * 3000

Q2: I applied the following defaults:

JOB_ROUTER_ROUTE_t1_defaults @=jrt
ÂÂ UNIVERSE VANILLA
ÂÂ default_xcount = 4
ÂÂ default_maxMemory = 4321
ÂÂ default_maxWallTime = 61
@jrt


ÂBut I'm a bit confused with the overall results:

0) I submit a minimal test job:
[sdalpra@ui-htc htjobs]$ condor_submit -pool ce01t-htc.cr.cnaf.infn.it:9619 -remote ce01t-htc.cr.cnaf.infn.it ce_testp308.sub
Submitting job(s).
1 job(s) submitted to cluster 610.

1) The job is routed
[root@ce01t-htc ~]# condor_ce_q 610. -af routedtojobid
8428.0

2) I check classads from the routed job

[root@ce01t-htc ~]# condor_q 8428.0 -af:jln jobstatus CpusProvisioned xcount requestcpus OriginalCpus remote_NodeNumber remote_SMPGranularity BatchRuntime OriginalMemory remote_OriginalMemory OriginalCpus remote_NodeNumber remote_SMPGranularity
ID = 8428.0
Âjobstatus = 2
ÂCpusProvisioned = 1
Âxcount = undefined
Ârequestcpus = 1
ÂOriginalCpus = 4
Âremote_NodeNumber = 4
Âremote_SMPGranularity = 4
ÂBatchRuntime = 3660
ÂOriginalMemory = 4321
Âremote_OriginalMemory = 4321
ÂOriginalCpus = 4
Âremote_NodeNumber = 4
Âremote_SMPGranularity = 4


So this is where i'm puzzled:
- I would expect to see xcount = 4 but it is undefined instead.
- The running job reports CpusProvisioned = 1, and that makes me think that
remote_NodeNumber = 4, remote_SMPGranularity = 4, OriginalCpus = 4
are somehow ignored.
- BatchRuntime is there, with the proper value set as expected (61 * 60) however i'm not sure on the meaning.
The htcondor manual says: << For batch grid universe jobs, a limit in seconds on the jobâs execution time, enforced by the remote batch system.>> who is "remote" in this context? Does that mean that condor-ce would stop the running routed job after 61 minutes? Moreover,
we have here a Vanilla universe job, at both CE and batch side:

[root@ce01t-htc ~]# condor_ce_q 610. -l | grep -i univer
JobUniverse = 5

[root@ce01t-htc ~]# condor_q -l 8428.0 | grep -i univer
JobUniverse = 5
Remote_JobUniverse = 5

Thanks for any comment
Stefano