[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference (solved)



Hi Chistoph,

Thanks for the great simplification of issues/improvements and for corroborating Charles great findings. As for your question, sadly at the moment there is not a fix for not having to do SIGKILL so the host classad does not disappear at the moment. We have opened a couple of tickets on our end to hopefully fix this issue that is #2 of your suggested improvements:

As for your suggested improvement #1 as of V10.4.0 of HTCondor using partitionable slots the partitionable slot ad (not dynamic slot ads) contains two attributes called NumDynamicSlots and NumDynamicSlotsTime. NumDynamicSlots is pretty straight forward as it is the number of dynamic slots currently created. The NumDynamicSlotsTime is the last time that the value for NumDynamicSlots changes whether from destruction or creation of a dynamic slot. So, in theory one could use these attributes to determine if a partitionable slot has been idle for a while rather than using the startd cron. I believe some config like below should do the trick:
ShouldHibernate = isUndefined(NumDynamicSlots) ? False : (NumDynamicSlots == 0 && (isUndefined(NumDynamicSlotsTime) ? time() - EnteredCurrentActivity : time() - NumDynamicSlotsTime) > $(TimeToWait))

These attributes come from the use of a new feature in V10.4.0 called Startd latches which are pretty cool on their own, and feel free to let me know if I should elaborate on what is happening in the configuration line.

Hope my ramblings make sense,
Cole Bollig


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: Wednesday, May 24, 2023 7:04 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Hibernate and Cron interference (solved)
 
Hi,

I would like to 2nd what Charles found out and documented very precisely below :)

Especially the 2nd point below is crucial to me as I run exactly into the same problem here.

I need to change the KILLSIGNAL in condor.service as otherwise the host classadd disappears forever.

@Condorteam: Is there a quick fix for it other than changing to SIGKILL and thus kill the condor daemons rather rude style every time ?

Here the short sum up of Charles proposal"

Room for improvement
====================

In my opinion, there are two improvements possible to be made :

1. Provide a easier way to detect if a machine has been sitting idle for
more than some time, removing the need for the CronTask that counts
slots.

2. Improve the switch to hibernation where a full, clean system shutdown
is requested. That way, no need to kill -9 condor ! I think it works
with the current code when you suspend instead of powerdown, or if you
have different start/stop scripts. However, with the provided systemd
unit, it does not work that well :-/.

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- Ursprüngliche Mail -----
Von: "Charles Goyard" <cgoyard@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 10. Februar 2023 14:48:57
Betreff: Re: [HTCondor-users] Hibernate and Cron interference (solved)

Hi all,

thanks to hints and advices from this list, I was able to setup a
working hibernation setup.

This message is a summary of the final setup and a discussion on how
things could be easier in the future ;).

Note: the topic of the discussion is misleading, since the problems we
experience have nothing to do with Condor Cron.


The context
===========

We have a VFX renderfarm, with compute-only and workstations with cycles
scavenging.

The changes we wanted to implement are to be able to completely power
down computer (clean shutdown), and to be able to run several jobs on a
single machine, to take advantage of the various IO waits caused by
massive threading.


On Execution Points, we have :
==============================

# Do dynamic partionning

MAX_SLOTS = <%= @max_slots %> # This is set from 1 to 4 depending on the CPUs.

use feature:PartitionableSlot

MODIFY_REQUEST_EXPR_REQUESTCPUS   = quantize(RequestCpus, {1})
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {4096})
MODIFY_REQUEST_EXPR_REQUESTDISK   = quantize(RequestDisk, {1024})

START = ( TotalSlots <= $(MAX_SLOTS) + 1 )


# Wake-On-Lan and hibernation
# We figure out the WOL capability from the output of ethtool

TimeToWait = 3600
HibernateState = "S5"

SecondsMachineIdle = 0

ShouldHibernate = ( ( SecondsMachineIdle > $(TimeToWait) ) )

HIBERNATE = ifThenElse ( $(ShouldHibernate), $(HibernateState), "NONE" )
HIBERNATE_CHECK_INTERVAL = 60

# Hack to detect activity from the number of active slots.
# It increments SecondsMachineIdle as long as the number of slots is exactly 1.

use feature:StartdCronContinuous(SecondsMachineIdleUpdater,/usr/local/htcondor/update_secondsmachineidle.sh)

with update_secondsmachineidle.sh being :

#!/bin/bash
#
# This updates the SecondsMachineIdle, which represents the time a machine has
# been seen as having only one slot. The idea is that is a machine has only one
# slot for a long time, it means it is unused and can be powered off.
#
# See https://www-auth.cs.wisc.edu/lists/htcondor-users/2022-December/msg00048.shtml

sleeptime=20
secondsidle=0

read -r addr<`condor_config_val startd_address_file`

while true; do
    sleep $sleeptime
    secondsidle=`condor_status -limit 1 -direct "$addr" -af "TotalSlots==1 ? $sleeptime + $secondsidle : 0"`
    echo -e "SlotID=1\nSecondsMachineIdle=${secondsidle}\n-s1\n"
done


Finally, we have to kill HTcondor somewhat violently to prevent a stray
ClassAd that does not include the Hibernation information :

The condor.service unit file :

[Unit]
Description=Condor Distributed High-Throughput-Computing
After=network.target nslcd.service openntpd.service
Wants=network.target

[Service]
EnvironmentFile=-/etc/default/condor
ExecStart=/usr/sbin/condor_master -f
Delegate=true
# In the future, we will use ExecStop with a synchronous condor_off
KillMode=mixed
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=1minute
WatchdogSec=20minutes
TimeoutStopSec=150seconds
StandardOutput=journal
NotifyAccess=main
KillSignal=SIGKILL  ## <-- KILL instead of QUIT fixes hibernation
# Matches values in Linux Kernel Tuning script
LimitNOFILE=32768
TasksMax=4194303

[Install]
WantedBy=multi-user.target


On the Central Manager, we have
===============================

# Rooster wakes nodes up
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, ROOSTER, SHARED_PORT

COLLECTOR_PERSISTENT_AD_LOG = /vol/condor/offline_ads/PersistentAdLog

ABSENT_REQUIREMENTS = ( (HibernationLevel?:0) == 0 )
EXPIRE_INVALIDATED_ADS = True
CLASSAD_LIFETIME = 900
# 604800s is 7 days
ABSENT_EXPIRE_ADS_AFTER = 604800
OFFLINE_EXPIRE_ADS_AFTER = 604800

ROOSTER_INTERVAL = 180
ROOSTER_UNHIBERNATE = ( Offline && Unhibernate )
ROOSTER_UNHIBERNATE_RANK = buf_cpuindex_avg



Things seems to be working well for a few days, we were able to remove
the system cron that removed the Absent flag from ClassAds. So far so
good !


Side changes
============

We changed from UDP to TCP for communication between EPs and the CM.


Room for improvement
====================

In my opinion, there are two improvements possible to be made :

1. Provide a easier way to detect if a machine has been sitting idle for
more than some time, removing the need for the CronTask that counts
slots.

2. Improve the switch to hibernation where a full, clean system shutdown
is requested. That way, no need to kill -9 condor ! I think it works
with the current code when you suspend instead of powerdown, or if you
have different start/stop scripts. However, with the provided systemd
unit, it does not work that well :-/.


Thank you :)
============

Thanks to Todd, Todd and Christoph for their help. Kudos to the whole
Condor team for this wonderful software !
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/