[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] parallel universe : some questions about nodes allocation and preemption



Hello,

I am doing some tests with the parallel universe on a small test pool (4 Worker Nodes). I am using static slots on the WN. Each WN has 8 cores.

I am not very opimistic about the feasability of what I am asking for but before giving up I would like to have a confirmation from the experts. Below are my questions.

1) For a MPI job, is it possible to force the number of nodes and also balance the slots allocation between these nodes ? For example, if I submit a 16 cores MPI jobs (machine_count=16), is it possible to tell to HTCondor to allocate only 2 WN with 8 cores each ? With Torque/Maui we dot it with "#PBS -l nodes=2:ppn=8". We plan to migrate our parallel cluster from Torque/Maui to HTC. The migration is already done for the single/multicores cluster.

In the NEGOTIATOR_PRE_JOB_RANK expression I use a ranking based on a WN_ID (IP address converted to int), to have depth_first allocation. This works fine. But suppose now that I submit a 10 cores MPI job while all the slots are Idle, I will have 8 cores Claimed on one WN and 2 cores on the next WN based on the WN_ID ranking. I would prefer to have 5 cores allocated on each WN (balanced allocation) and avoid the other combinaisons.

2) In my tests, one user (puser) submit parallel jobs. Another user (vuser) submit vanilla single core jobs. puser has highier priority than vuser. My PREEMPTION_REQUIREMENTS allows the preemption of vuser's jobs. It works, but the problem is the following: suppose that 32 vuser jobs are already running, if puser submit a 2 cores MPI job, all the 32 jobs of vuser will be preempted and put back in queue. It is possible to configure HTCondor to preempt only the required number of vanilla jobs ? In my example, I would like to have only 2 vallina jobs preempted instead of 32 jobs. What I have observed is the following: at each negociation cycle HTCondor preempt n slots (when possible) if the MPI jobs need n slots in total and the already preempted slots have not yet finished retiring/vacating. At the end there may be n+n+n+... slots preempted and the MPI job will use only n of them while the other will stay 'Claimed Idle'.

3) In relation with 2).
When the MPI jobs starts, the preempted unused slots will remain 'Claimed Idle' for ~10 minutes before beeing 'Unclaimed Idle' ou 'Claimed Busy'. Setting 'UNUSED_CLAIM_TIMEOUT = 120' on the Scheduler has no effect. Is there an explanation for that ?

Thanks in advance for you help,

Christophe.

--
Christophe DIARRA
Institut de Physique Nucleaire
15 Rue Georges Clemenceau
S2I/D2I - Bat 100A - Piece A108
F91406 ORSAY Cedex
Tel:    +33 (0)1 69 15 65 60 / +33 (0)6 31 26 23 69
Fax:    +33 (0)1 69 15 64 70 / E-mail:diarra@xxxxxxxxxxxxx