[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor bug matching with an offline machine



Hi Todd,

after quite some time, we now caught this effect again, but I still have trouble to understand the course of events. Let me please clarify once again:

You wrote: 
"if it was HTCondor itself the machine
classads would be tagged as offline and would not be matched until
successfully woken"

So this means: a machine would first need to become online before it can be matched, right? In our case, this does not seem to be the case. I have lots of lines in the MatchLog indicating that the job was matched to an offline machine over and over again:

Matched 2331.0 x.y@z <10.10.---.---:49991?addrs=10.10.---.----49991> preempting none <10.10.---.---:51516?addrs=10.10.---.----51516> <MachineName>.domain.net (offline)

At the same time, Rooster is trying to wake up that computer, but fails. Thus the job never gets executed.

Can anyone give me another hint how to prevent this situation? Maybe I can limit the number of times a job gets matched to the same machine, since, after some time, chances are that something goes wrong here?

Thanks and best regards,
Jens



-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Donnerstag, 9. März 2017 22:25
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor bug matching with an offline machine

By default, HTCondor has a NEGOTIATOR_POST_JOB_RANK expression that prefers online machines to offline machines So, unless you changed this knob, the only way a job would match an offline machine is that the job didn't match any online machines, or when the job itself has a rank expression that prefers the offline machine. 

You could set a NEGOTIATOR_PRE_JOB_RANK expression to have it prefer online machines over offline regardless of what the
Job rank is set to.   But even then an offline machine would still get matches if there were no online machines that match that job.

The only way to prevent matches entirely would be to tell the negotiator not to fetch offline ads for matchmaking, or remove the offline machines from the collector entirely

This is partly why we are considering having a separate list of machines (or slots) that could be sent to the schedd or negotiator to tell it to ignore certain machines for some period of time.  The only way to do this currently is to have insanely complicated negotiator fetch expressions or to have the schedd mutate the job's requirements expressions.
 
-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Jens Schmaler
Sent: Thursday, March 9, 2017 12:42 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor bug matching with an offline machine

Dear Todd and John,

let me add a little more background here:

Our non-dedicated machines are sometimes switched off by their owners - we do not have HTCondor configured to do that. HTCondor then treats them as "absent" at first, but we have a cron job configured which regularly advertises the absent class ads as "offline", because we want HTCondor to wake them up upon need.

As described by Almansour, the problem arises when wake-on-LAN is not successful for one of those machines. In spite of the fact that we have the "black hole policy" in place as you suggest, for some reason HTCondor sometimes keeps on matching the job to the same (offline) machine over and over again, although other machines would be available.
So I am wondering

- whether there is a possibility that the LastMatchName somehow is not updated, so that the same machine may be matched again.

- why the job is at all matched to an offline machine, in spite of the fact that online machines would be available.

Do you think there might be a bug here?

@John: We have often had trouble with black hole machines, so any improvement in handling this would be highly appreciated. We would definitely be interested to discuss about possible changes here!

Thanks and best regards,
Jens



Am 09.03.17 um 09:37 schrieb Almansour Blanco:
> Hello,
> 
> The machines we have that are offline are machines that have wake on 
> lan problem We keep them in the list in order to be woken up 
> eventually, so that when we change something in the BIOS, there might 
> be a possibility that the machine wakes up. As we add more and more 
> machines, the problem is always prone to happen again, and have an 
> offline machine that will become a black hole.
> 
> Beste regards,
> 
> Almansour Belleh Blanco
> 
> 
> 
> -----Original Message----- From: HTCondor-users 
> [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M 
> Knoeller Sent: Freitag, 3. März 2017 19:23 To: HTCondor-Users Mail 
> List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Condor 
> bug matching with an offline machine
> 
> I'm not sure I understand, why would you put machines in the offline 
> state (instead of just removing them from the collector entirely) if 
> you don't want them to match and get revived when they are needed to 
> run jobs?
> 
> Does the fact they are in the collector but offline serve some other 
> purpose for your pool?
> 
> Also, I wonder if you would have some time to discuss a possible 
> improved interface for dealing with blackhole machines. We are 
> considering making tagging a machine as a blackhole a first -class 
> operation in HTCondor, so that you would be Able to do it without 
> changing any job's requirements expression or removing the machines 
> from the collector entirely.
> 
> For instance, maybe you could send a command to the schedd to tell it 
> to not accept matches with a machine.  Would That be an acceptable 
> solution or part of a solution to your current problem?
> 
> -tj
> 
> -----Original Message----- From: HTCondor-users 
> [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Almansour 
> Blanco Sent: Thursday, March 2, 2017 9:18 AM To: HTCondor-Users Mail 
> List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Condor 
> bug matching with an offline machine
> 
> Hello,
> 
> Thank you for your response. We already have the blackhole 
> configuration in place, so the jobs always try in another machine.
> What I think is a bug is that it should not match with an offline 
> machine in first place
> 
> Kind regards,
> 
> 
> -----Original Message----- From: HTCondor-users 
> [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd 
> Tannenbaum Sent: Donnerstag, 2. März 2017 14:26 To: HTCondor-Users 
> Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] 
> Condor bug matching with an offline machine
> 
> On 3/2/2017 4:44 AM, Almansour Blanco wrote:
>> Hello,
>> 
>> I am using condor 8.4.3 on windows 7 64 bits.
>> 
>> I have a strange bug.
>> 
>> We have some machines that have Wake on LAN  problems in the network.
>> 
>> What happens is, when a job is matched to one of these machines while 
>> it is offline, it tries to wake it up, which fails of course.
>> 
>> On the next negotiation cycle, the same job can be matched to this 
>> machine again, and this keeps on happening again and again.
>> 
>> The normal course of action, as far as I know, when a job is matched 
>> to a machine and it fails, it will always try with another machine, 
>> which doesn't seem to be the case here.
> 
> What is putting the machine to sleep such that it needs to be waken up 
> over the LAN?  Is it some screen saver, or HTCondor itself via the 
> HIBERNATE expression in the condor_config file?  I am guessing it is a 
> screen saver or some such, as if it was HTCondor itself the machine 
> classads would be tagged as offline and would not be matched until 
> successfully woken.  So perhaps one idea is to use HTCondor for your 
> power management and let it control when to put machines offline; see
> 
> http://htcondor.org/manual/current/3_18Power_Management.html
> 
> When a job is matched to a machine and fails, it will not necessarily 
> try another machine - it may, as you observe, try the same machine 
> again.  This is something we should consider improving in a future 
> release, however, for now you can use job policy expressions in your 
> job's condor_submit file to achieve this.  For an example of how to do 
> this, see the HOWTO recipes, specifically
> 
> https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles
>
>  Hope the above helps Todd
> 
> _______________________________________________ HTCondor-users mailing 
> list To unsubscribe, send a message to 
> htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can 
> also unsubscribe by visiting 
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________ HTCondor-users mailing 
> list To unsubscribe, send a message to 
> htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can 
> also unsubscribe by visiting 
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________ HTCondor-users mailing 
> list To unsubscribe, send a message to 
> htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can 
> also unsubscribe by visiting 
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________ HTCondor-users mailing 
> list To unsubscribe, send a message to 
> htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can 
> also unsubscribe by visiting 
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/