Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Computers missing from Condor pool
- Date: Tue, 26 Feb 2008 15:34:39 +0100
- From: Rob de Graaf <r.degraaf@xxxxxxxxxxxx>
- Subject: [Condor-users] Computers missing from Condor pool
Hello all,
We run a Condor pool consisting of a Linux central manager and some
4.500 Windows XP execute nodes. Almost all of these have dual core CPUs,
so on a good day we would expect to see 7.000+ virtual machines in our
pool. The problem is that we don't see that many, in fact we only see
around 5.000 at peak hours. For a few weeks now, we've been trying to
find our "missing" computers, with little success.
Of course the first thing we did was to make sure Condor was properly
installed on all machines, and that there are no connectivity issues
preventing hosts in part of the network from connecting to the manager.
We had cron periodically parse condor_status -l for "new" host names,
building a unique list. It grew quickly, and now contains over 4.400
unique host names (they contain the MAC-address). This tells us that
Condor is in fact installed on all computers, and that they all can
connect to the central manager, having been in the pool at some point.
The next thing we did was to make sure the "missing" computers weren't
simply powered down. We conducted ping sweeps at different times and on
various parts of the network, compared the results to the condor_status
output, and we consistently found many more live hosts than were
appearing in the pool, up to twice as many at times. We concluded there
are computers that have Condor installed, have been in the pool at
some point, are powered on and responding to ping, but are not appearing
in the pool for some reason.
Our next step was to find out if the collector daemon was a bottleneck.
We created a tcpdump of traffic on the collector port, and compared it
to the actions of the collector daemon; specifically we looked for UDP
containing "Command = 0" and compared to UPDATE_STARTD_ADS as logged by
the collector daemon. We found that our collector is not a bottleneck;
it appears to be processing all incoming updates as expected.
During our analysis of traffic on the collector port, we did find that
sometimes execute nodes will not send complete updates via UDP, see:
https://lists.cs.wisc.edu/archive/condor-users/2008-January/msg00231.shtml
The suggested fix, adding a delay by setting the D_NETWORK debug flag,
has been applied on all computers and has had some effect; the average
pool size has gone up, but not by as much as we had hoped, and ping
sweeps still reveal many more live machines not appearing in the pool,
leading us to believe there is still some other problem.
We've looked at master and startd log files but we haven't been able to
find anything seriously wrong, and we're running out of ideas.
What could be causing computers to sometimes be missing from our pool,
and what else can we do to find them?
Thanks,
Rob de Graaf