[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computers missing from Condor pool

On Mar 7, 2008, at 8:50 AM, Rob de Graaf wrote:

James wrote:
Do a "condor_status -l | condor_updates_stats | grep "Stats:"  And
check for lost updates.

The change to the UDP buffer has decreased the percentage of lost
updates as shown by condor_updates_stats by quite a bit; mostly 0-2%
lost updates with some spikes at 10%, compared to 10-30% all round
before the buffer increase. While this is definitely an improvement,
we're still not satisfied with the number of hosts in the pool; ping
sweeps still show some 20% additional live hosts the collector doesn't
know about.

Is there anything else we could try?

You can try setting CLASSAD_LIFETIME on the config file on your central manager. The default is 900 seconds. By increasing the value, ads are less likely to disappear due to dropped UDP packets. It also means that ads for crashed machines will linger longer in the collector, potentially being matched to jobs which then can't execute for a short period of time.

|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |