[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Matching to not responding machines



Hi Hermann,

On 03/28/2012 11:32 AM, Hermann Fuchs wrote:
However, I would like to implement some kind of a failure detection for
the running grid as network problems will and do occur.
Is there a query which is only answered when the machines do
communicate?
condor_status seems to be misleading, the machines listed there which
stopped communicating remain there in some cases (e.g. the mentioned
case).

You could use INVALIDATE_STARTD_ADS (man condor_advertise) to make the collector forget about specific machines. You would need to know which machines to invalidate. The only way I can think of right now is to ask them directly (condor_status -direct or maybe condor_config_val) and check the exit status of those commands. The downside of this approach is that you will have to endure a timeout for every machine that has the problem. If you have hundreds or thousands of machines, it will quickly become unfeasible.

Alternatively, you could tweak CLASSAD_LIFETIME on the collector to make it forget about unresponsive machines more quickly, but it might also accidentally invalidate working machines if any updates get lost on the network. See: http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#SECTION004316000000000000000

Regards,

Rob