[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor bug matching with an offline machine



Hello,

The machines we have that are offline are machines that have wake on lan problem
We keep them in the list in order to be woken up eventually, so that when we change something in the BIOS, there might be a possibility that the machine wakes up.
As we add more and more machines, the problem is always prone to happen again, and have an offline machine that will become a black hole.

Beste regards,  

Almansour Belleh Blanco



-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Freitag, 3. März 2017 19:23
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor bug matching with an offline machine

I'm not sure I understand, why would you put machines in the offline state (instead of just removing them from the collector entirely) if you don't want them to match and get revived when they are needed to run jobs?

Does the fact they are in the collector but offline serve some other purpose for your pool?

Also, I wonder if you would have some time to discuss a possible improved interface for dealing with blackhole machines. 
We are considering making tagging a machine as a blackhole a first -class operation in HTCondor, so that you would be Able to do it without changing any job's requirements expression or removing the machines from the collector entirely. 

For instance, maybe you could send a command to the schedd to tell it to not accept matches with a machine.  Would That be an acceptable solution or part of a solution to your current problem?

-tj

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Almansour Blanco
Sent: Thursday, March 2, 2017 9:18 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor bug matching with an offline machine

Hello, 

Thank you for your response.
We already have the blackhole configuration in place, so the jobs always try in another machine.
What I think is a bug is that it should not match with an offline machine in first place

Kind regards,


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Donnerstag, 2. März 2017 14:26
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor bug matching with an offline machine

On 3/2/2017 4:44 AM, Almansour Blanco wrote:
> Hello,
>
> I am using condor 8.4.3 on windows 7 64 bits.
>
> I have a strange bug.
>
> We have some machines that have Wake on LAN  problems in the network.
>
> What happens is, when a job is matched to one of these machines while 
> it is offline, it tries to wake it up, which fails of course.
>
> On the next negotiation cycle, the same job can be matched to this 
> machine again, and this keeps on happening again and again.
>
> The normal course of action, as far as I know, when a job is matched 
> to a machine and it fails, it will always try with another machine, 
> which doesn't seem to be the case here.

What is putting the machine to sleep such that it needs to be waken up over the LAN?  Is it some screen saver, or HTCondor itself via the HIBERNATE expression in the condor_config file?  I am guessing it is a screen saver or some such, as if it was HTCondor itself the machine classads would be tagged as offline and would not be matched until successfully woken.  So perhaps one idea is to use HTCondor for your power management and let it control when to put machines offline; see

   http://htcondor.org/manual/current/3_18Power_Management.html

When a job is matched to a machine and fails, it will not necessarily try another machine - it may, as you observe, try the same machine again.  This is something we should consider improving in a future release, however, for now you can use job policy expressions in your job's condor_submit file to achieve this.  For an example of how to do this, see the HOWTO recipes, specifically

   https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles

Hope the above helps
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/