[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Condor bug matching with an offline machine
- Date: Thu, 02 Mar 2017 07:25:57 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Condor bug matching with an offline machine
On 3/2/2017 4:44 AM, Almansour Blanco wrote:
I am using condor 8.4.3 on windows 7 64 bits.
I have a strange bug.
We have some machines that have Wake on LAN problems in the network.
What happens is, when a job is matched to one of these machines while it
is offline, it tries to wake it up, which fails of course.
On the next negotiation cycle, the same job can be matched to this
machine again, and this keeps on happening again and again.
The normal course of action, as far as I know, when a job is matched to
a machine and it fails, it will always try with another machine, which
doesn’t seem to be the case here.
What is putting the machine to sleep such that it needs to be waken up
over the LAN? Is it some screen saver, or HTCondor itself via the
HIBERNATE expression in the condor_config file? I am guessing it is a
screen saver or some such, as if it was HTCondor itself the machine
classads would be tagged as offline and would not be matched until
successfully woken. So perhaps one idea is to use HTCondor for your
power management and let it control when to put machines offline; see
When a job is matched to a machine and fails, it will not necessarily
try another machine - it may, as you observe, try the same machine
again. This is something we should consider improving in a future
release, however, for now you can use job policy expressions in your
job's condor_submit file to achieve this. For an example of how to do
this, see the HOWTO recipes, specifically
Hope the above helps