[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] freezing Condor tasks for maintenance?



Hi all,

has somebody experiences with freezing whole Condor process trees for
node maintenance...?

background: we had to do a bit of hand work on a few jobs, where mounts
required a bit of attentions. On the way it seemed to be nice to freeze
the jobs, so to be able to work on mounts without affecting jobs, i.e.,
if a mount disappears for a moment.

Playing on a test node, it seems that one can add the Condor process
tree to a freezer cgroup and hibernate it for some time without
affecting the daemons health (provided that the freeze is sufficiently
short not to be assumed dead by the collector)
But maybe somebody has already experiences if it works for real-life
scenarios with user jobs, which might be more sensible to freeze, or how
the system reacts if a full node reappears with all jobs after being
absent for too long (and jobs got already resubmitted)?

Ideally, it would be nice to have frozen processes to survive a reboot,
but so far my attempts with CRIU [https://criu.org] where not very
successful (probably it works better with binaries than shell scripts
started in an active session...?)

Cheers,
  Thomas

[1]
> mkdir /sys/fs/cgroup/freezer/mycondorfreeze/
> while read X; do echo ${X} >>
/sys/fs/cgroup/freezer/mycondorfreeze/tasks; done <
/sys/fs/cgroup/memory/system.slice/condor.service/tasks
> cat /sys/fs/cgroup/freezer/mycondorfreeze/freezer.state
THAWED
> echo FROZEN > /sys/fs/cgroup/freezer/mycondorfreeze/freezer.state
...wait...
> echo THAWED > /sys/fs/cgroup/freezer/mycondorfreeze/freezer.state



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature