[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] How to recover node without draining/restarting



Hi all,

we had a node in a funny(?) state, where 'owner' appeared as unknown
slot type [1]. Other nodes worked fine and also the node itself was full.
We suspect some correlation with a remote fs, that timeouted. However,
after getting the remote fs running again and trying to reload the
condor config, the unknown-slot did not recover. So we started to drain
and restart the node avoiding to restart condor fearing to loose all
current jobs.

So, our question is, if there is a better way to recover a node without
draining or losing current jobs?

Cheers and thanks,
  Thomas




[1]
slot1@xxxxxxxxxxxxxxxxx batch0943.desy.de Owner 0 32.0 Partitionable
false Problem
### UNKNOWN SlotType = "Owner"

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature