[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Defrag shall not preemt jobs



Dear list,

just stumbled over an increased job failure rate of ATLAS jobs at our
site. ATLAS is running a mixture of single-core & multi-core jobs. In
order to not let multi-core jobs starve, condor_defrag runs.

Looks like condor_defrag is evicting single-core jobs giving them
MaxVacateTime to come to an end (DEFRAG_DRAINING_SCHEDULE = graceful):

10/18/20 19:19:53 slot1_2[33437.0]: max vacate time expired.  Escalating to a fast shutdown of the job.
10/18/20 19:19:53 slot1_1[74229.0]: max vacate time expired.  Escalating to a fast shutdown of the job.

However, this is unwanted! It actually kills jobs here. 

There's probably a knob for it - but which one do I need to turn to
just drain the (partitionable) slot until enough resources for the
usual eight-core jobs are freed (without actively vacating running jobs
from the chosen system)?

Thanks,
Andreas
-- 
| Andreas Haupt            | E-Mail: andreas.haupt@xxxxxxx
|  DESY Zeuthen            | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6         | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen         | Fax:    +49/33762/7-7216

Attachment: smime.p7s
Description: S/MIME cryptographic signature