[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preemption problems on particular execution nodes



On Wed, Nov 06, 2019 at 12:14:52PM +0100, Henning Fehrmann wrote:
> Hi,
> 
> we have currently HTCondor v 8.6.5 installed. Our sandbox system has

Sorry, it is HTCondor v 8.8.5.

> - one submit host
> - one master node working as the collector and negotiator
> - and two different execution nodes a3010 (128 cores) and a3001 (32 cores)
>   with similar configurations.
> 
> We use partitionable slots on the execution nodes:
> SLOT_TYPE_1 = ram=438065, swap=0%, cpus=100%
> NUM_SLOTS_TYPE_1 = 1
> SLOT_TYPE_1_PARTITIONABLE = True
> 
> Additionally, we do have two users 1 and 2 with very different EUP
> (EUP_1 >> EUP_2) and want to test preemption.
> 
> However it does not work always.
> 
> I)
> We perform the following experiment:
> - User 1 starts jobs wich are sheduled on node a3010 (single core jobs).
> - All jobs are running.
> - User 2 with a way better EUP submits jobs (multicore jobs)
>   which ought to preempt running jobs of user 1 on node a3010.
> 
> However, even though ALLOW_PSLOT_PREEMPTION = True this does not happen.
> 
> II)
> We perform the same experiment on node a3001 and see a different result.
> The jobs of user 1 are now being preempted and the slots are occupied
> by the jobs of user 2.
> 
> The negotiator logs with D_FULLDEBUG for experiment I) can be
> downloaded here:
> https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/NegotiatorLog_a3010.gz
> 
> for experiment II):
> https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/NegotiatorLog_a3001.gz
> 
> The configuration of the collector and negotiator node:
> https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/negotiator_collector.txt.gz
> 
> The configuration of the schedd:
> https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/schedd.txt.gz
> 
> The configuration of execution node a3010:
> https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/startd_a3010.txt.gz
> 
> The configuration of execution node a3001:
> https://www.atlas.aei.uni-hannover.de/~fehrmann/Condor/startd_a3001.txt.gz
> 
> A negotiator config snippet with what we think is important:
> 
> PREEMPTION_REQUIREMENTS = True
> ALLOW_PSLOT_PREEMPTION = True
> PREEMPTION_RANK = (RemoteUserPrio * 1000000) - ifThenElse(isUndefined(TotalJobRunTime), 0, TotalJobRunTime)
> NEGOTIATOR_CONSIDER_EARLY_PREEMPTION = True
> NEGOTIATOR_CONSIDER_PREEMPTION = true
> 
> Interestingly enough, if we set ALLOW_PSLOT_PREEMPTION to False
> scenario I works, if user 2 has request_cpus = 1.
> 
> Thank you in advance for feedback.
> 
> Cheers,
> the Atlas team.
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/