[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Windows 2008 Client running in VMware losing Condor heartbeat



On Friday, 2 September, 2011 at 9:20 AM, Greg_Sterling@xxxxxxxx wrote:
I’m running a highly virtualized environment with windows 2003 clients.  We need to begin running with windows 2008 clients, but the problem that we are seeing is that the 2008 VMs are dropping off from condor_status (and thus not accepting jobs) after an idle time of a few hours.  If you RDP into the system then it re-establishes a network connection for a few hours again before dropping off again.

 

I know that there are a lot of possible issues, but I am curious if there is anything established or well documented regarding this issue.

Not much documented yet. It appears to be a not-widely-used OS in the Condor world right now. I'm doing my own troubleshooting with very large Win2k8 systems (Dell R610s as well actually, 40-core boxes) and I'm keeping notes. Hoping to do a tweaking guide when I've got it all sorted out. 

 

So the moving parts…

Dell R610 Servers deployed with Broadcom 1Gb adapters

VMware ESXi 4.1

Windows 2008 Server

Condor 7.2.4

If you can, use 7.6.x. At the very least 7.4.x. You're really going to want procd support on Win2k8 to be able scale up the execution nodes and the procd improved considerably between 7.2 and 7.6.  


Things I have tried or am in the process of trying:

Ensure Windows Firewall isn’t causing issues (will disable)

Check Windows 2008 Power Management settings

That's the first place I'd start -- it sounds like your machines are going to sleep. 

Try running on a system that has Intel network adapters (we mostly use Broadcom)

We have tried both 7.2.4 and 7.6.2 versions of condor.

Ah. Well. Pity. I'd still recommend sticking with the newer version. 

Investigate Broadcom network adapter settings in VMware.

Try enabling Wake On LAN functionality in VMware.

Set VMware CPU reservation to a non-zero value (might be idling VM to 0% usage)

So at present I'm unable to make Condor 7.6.2 fully utilize a 40-core machine. I can get about 28 jobs running on the box before it starts to run in to serious inter-daemon communication issues, with comm sockets being closed while the startd is trying to spawn starters and tell them to run jobs. This in both Win2k8 ES and Win2k8 R2 ES. One solution has been to carve the box up with VMWare in to 8-core machines (max # of cores the VMWare version we're using supports). I couldn't fully load the machines with 5 x 8-core VMs. I had to reserver two CPUs on the box otherwise VMWare started to fall over. Other than that it's more or less just worked as a stop-gap -- albeit an expensive one -- until I can figure out why Condor isn't happy running 40 jobs at a time on the boxes.
What I am worried about is that the VMware network driver is doing some ‘magic’ and intercepting ping and other network requests and not passing them to the VM so that it can attempt to idle/suspend the VM.
AFAIK we didn't do any special tuning to the VMWare side of things with the stop-gap solution. The OS has all the power save features flipped off though.

I'm definitely interested in hearing more about your results with Win2k8. I've been battling to stabilize big Win2k8 pools for a while now.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing