[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Computers missing from Condor pool



Hello all,

We run a Condor pool consisting of a Linux central manager and some 4.500 Windows XP execute nodes. Almost all of these have dual core CPUs, so on a good day we would expect to see 7.000+ virtual machines in our pool. The problem is that we don't see that many, in fact we only see around 5.000 at peak hours. For a few weeks now, we've been trying to find our "missing" computers, with little success.

Of course the first thing we did was to make sure Condor was properly installed on all machines, and that there are no connectivity issues preventing hosts in part of the network from connecting to the manager.

We had cron periodically parse condor_status -l for "new" host names, building a unique list. It grew quickly, and now contains over 4.400 unique host names (they contain the MAC-address). This tells us that Condor is in fact installed on all computers, and that they all can connect to the central manager, having been in the pool at some point.

The next thing we did was to make sure the "missing" computers weren't simply powered down. We conducted ping sweeps at different times and on various parts of the network, compared the results to the condor_status output, and we consistently found many more live hosts than were appearing in the pool, up to twice as many at times. We concluded there are computers that have Condor installed, have been in the pool at some point, are powered on and responding to ping, but are not appearing in the pool for some reason.

Our next step was to find out if the collector daemon was a bottleneck. We created a tcpdump of traffic on the collector port, and compared it to the actions of the collector daemon; specifically we looked for UDP containing "Command = 0" and compared to UPDATE_STARTD_ADS as logged by the collector daemon. We found that our collector is not a bottleneck; it appears to be processing all incoming updates as expected.

During our analysis of traffic on the collector port, we did find that sometimes execute nodes will not send complete updates via UDP, see:

https://lists.cs.wisc.edu/archive/condor-users/2008-January/msg00231.shtml

The suggested fix, adding a delay by setting the D_NETWORK debug flag, has been applied on all computers and has had some effect; the average pool size has gone up, but not by as much as we had hoped, and ping sweeps still reveal many more live machines not appearing in the pool, leading us to believe there is still some other problem.

We've looked at master and startd log files but we haven't been able to find anything seriously wrong, and we're running out of ideas.

What could be causing computers to sometimes be missing from our pool, and what else can we do to find them?

Thanks,

Rob de Graaf