[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] VMs being cleaned up/removed




I am running Condor version 6.7.2 on Scientific Linux 3.0.3
with 11 dual-cpu worker nodes with 4 VMs each.  There are three schedulers
and the CM is using kerberos authentication.

I notice that fairly often, VMs will be "cleaned up" during housecleaning

Try a newer version of Condor.

Condor 6.7.3 has:

This release contains all the bug fixes from the 6.6 stable series upto and including version 6.6.7, and some of the fixes that will be included in version 6.6.8. The bug fixes in version 6.6.8 that were not included in version 6.7.3 are listed in a seperate section of the 6.6.8 version history.


Condor 6.6.8 has:
Fixed issues that would cause condor_ startd to ``disappear'' from the pool because of dropped machine ad updates. This fix applies to all platforms, but the symptoms were exhibited predominantly on Windows machines.

And this is one of the bug fixes included in 6.7.3.

So there is a decent shot that this problem will be fixed by upgrading to Condor 6.7.6, which is the most recent Condor release in the 6.7.x series.

The condor_startd advertises each virtual machine by sending a UDP update to the collector. In some busy networks, these updates can be lost. If upgrading doesn't work for you, you can tell Condor to use TCP instead. We don't use this as a default in order to avoid having hundred of simultaneous open TCP connections on large pools, but it's certainly reasonable for your small pool. You can learn how to configure this in the manual:

http://www.cs.wisc.edu/condor/manual/v6.7/3_11Setting_Up.html#sec:tcp-collector-update

Basically, you do "UPDATE_COLLECTOR_WITH_TCP = TRUE" in your config file.

I hope this helps. If it doesn't, please do let us know. It's not a feature that machines disappear from your pool!

-alain