Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Computers missing from Condor pool

Date: Tue, 26 Feb 2008 15:34:39 +0100
From: Rob de Graaf <r.degraaf@xxxxxxxxxxxx>
Subject: [Condor-users] Computers missing from Condor pool

Hello all,

We run a Condor pool consisting of a Linux central manager and some4.500 Windows XP execute nodes. Almost all of these have dual core CPUs,so on a good day we would expect to see 7.000+ virtual machines in ourpool. The problem is that we don't see that many, in fact we only seearound 5.000 at peak hours. For a few weeks now, we've been trying tofind our "missing" computers, with little success.

Of course the first thing we did was to make sure Condor was properlyinstalled on all machines, and that there are no connectivity issuespreventing hosts in part of the network from connecting to the manager.

We had cron periodically parse condor_status -l for "new" host names,building a unique list. It grew quickly, and now contains over 4.400unique host names (they contain the MAC-address). This tells us thatCondor is in fact installed on all computers, and that they all canconnect to the central manager, having been in the pool at some point.

The next thing we did was to make sure the "missing" computers weren'tsimply powered down. We conducted ping sweeps at different times and onvarious parts of the network, compared the results to the condor_statusoutput, and we consistently found many more live hosts than wereappearing in the pool, up to twice as many at times. We concluded thereare computers that have Condor installed, have been in the pool atsome point, are powered on and responding to ping, but are not appearingin the pool for some reason.

Our next step was to find out if the collector daemon was a bottleneck.We created a tcpdump of traffic on the collector port, and compared itto the actions of the collector daemon; specifically we looked for UDPcontaining "Command = 0" and compared to UPDATE_STARTD_ADS as logged bythe collector daemon. We found that our collector is not a bottleneck;it appears to be processing all incoming updates as expected.

During our analysis of traffic on the collector port, we did find thatsometimes execute nodes will not send complete updates via UDP, see:


https://lists.cs.wisc.edu/archive/condor-users/2008-January/msg00231.shtml

The suggested fix, adding a delay by setting the D_NETWORK debug flag,has been applied on all computers and has had some effect; the averagepool size has gone up, but not by as much as we had hoped, and pingsweeps still reveal many more live machines not appearing in the pool,leading us to believe there is still some other problem.

We've looked at master and startd log files but we haven't been able tofind anything seriously wrong, and we're running out of ideas.

What could be causing computers to sometimes be missing from our pool,and what else can we do to find them?


Thanks,

Rob de Graaf

Follow-Ups:
- Re: [Condor-users] Computers missing from Condor pool
  - From: Erik Paulson

Prev by Date: [Condor-users] Visually design Condor DAGs
Next by Date: Re: [Condor-users] Computers missing from Condor pool
Previous by thread: Re: [Condor-users] Unplugging network causes condor to dump core
Next by thread: Re: [Condor-users] Computers missing from Condor pool
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Computers missing from Condor pool