[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] various questions about dynamic provisioning



Hi all

we are running dynamic slots on many machines (mostly 7.6.x but should
migrate to 7.8 soon to get the condor_defrag daemon.

But I have a couple of question regarding dynamic slots which so far we
have not been able to solve:

*** Memory limits/RequestMemory/ImageSize
We run a user wrapper script which sets ulimits based on RequestMemory
or for the few machines which still are running static slots based on
the slot settings. Actually, we use 110% of this limit. This should
prevent users to exceed their allocated share.

However, we face the problem that if a job fails and goes back to the
queue the ImageSize is larger then RequestMemory and then jobs will be
scheduled to run on a host by the negotiator, the target machine will
partition off the wanted slot, but will fail to start the job as the
ImageSize is too large. The only hint we see here is something along the
lines of "Job requirements not met" in the StarterLog.slot1_x.

The user only sees the job being idle and condor_q -b telling her that
many machines potentially match the job.

Is there a way to either tell the user what the problem is or change our
way of requesting memory? E.g. RequestMemory should be the maximum of
(0.0011 * ImageSize and a statically given number)?

*** How to schedule a limited amount of jobs per execute node

In the good ol' days with static slots, you could add a requirement on
the slot number if you had jobs which were really heavy on the local
scratch disk. However, if you added something like

Requirements = strcmp("slot1_1@.*",RemoteHost)

due to obvious reasons (not being split off while being negotiated).

Is there a way to achieve this? I've looked at concurrency limits but so
far failed to find a good idea how to utilize for this scenario.

Thanks a bunch in advance!

Carsten


PS: Inconsistency in the condor manual:

http://research.cs.wisc.edu/condor/manual/v7.8/3_12Setting_Up.html#SECTION004128900000000000000

talks about request_{cpus,memory,disk)

while the underscore is missing here:

http://research.cs.wisc.edu/condor/manual/v7.8/11_Appendix_A.html#85420


-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
phone/fax: +49 511 762-17185 / -17193
https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome