[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Explaining the Claimed + Idle state



> > On Mon, Feb 07, 2005 at 03:25:21PM -0500, Ian Chesal wrote:
> > > I'm seeing a fair number of VM's in my system reporting
> > "Claimed + Idle"
> > > for a long, long period of time. What can bring about this state? 
> > > There are no starters on these machines. Condor does not
> > appear to be
> > > actually running anything. Yet they are claimed and 
> idling and not 
> > > doing any work.
> > > 
> > 
> > A condor_status -l to one of those machines should show what schedd 
> > has it claimed - it will be the ClientMachine attribute. I would be 
> > curious what the schedd is doing.
> 
> The are all claimed by the same machine ttc-eahmed3 -- and 
> this machine is showing LOTS of condor_write errors in its 
> SchedLog -- I've restarted condor on the machine (with net 
> stop/net start). No condor_write errors in the last 5 
> minutes, but I'm not holding my breath. This problem is far 
> from solved with a reboot.
> 
> This is the third machine at our site to get this "plague" of 
> condor_write errors for the schedd. It's no longer isolated 
> to two machines in two cubicles. See condor-admin bug report 
> #11869. This is beginning to worry me greatly. These machines 
> have full network connectivity. No dropped pings to these 
> machines. Condor just can't seem to keep an open port. 
> Happens on both Windows XP and Linux machines.
> They are all running 6.7.3 with SEC_DEFAULT_NEGOTIATION = 
> NEVER set to stop the condor_startd memory leak bug in 6.7.3. 
> Could this be the problem?
> 

More updates. I have a newly claimed machine that was claimed by
ttc-eahmed3 and it's in the Claimed + Idle state and the time for this
has climed to 10 mintues. I logged in and check the starterlog.vm2 log
for this machine and I'm seeing:

2/7 15:52:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
2/7 15:52:36 ******************************************************
2/7 15:52:36 ** condor_starter (CONDOR_STARTER) STARTING UP
2/7 15:52:36 ** d:\abc\condor\bin\condor_starter.exe
2/7 15:52:36 ** $CondorVersion: 6.7.3 Dec 28 2004 $
2/7 15:52:36 ** $CondorPlatform: INTEL-WINNT40 $
2/7 15:52:36 ** PID = 148
2/7 15:52:36 ******************************************************
2/7 15:52:36 Using config file: d:\abc\condor\condor_config
2/7 15:52:36 Using local config files:
d:\abc\condor\local.TTC-BS3066-183\condor
_config.local
2/7 15:52:36 DaemonCore: Command Socket at <137.57.176.183:1324>
2/7 15:52:36 Setting resource limits not implemented!
2/7 15:52:36 Communicating with shadow <137.57.142.131:2377>
2/7 15:52:36 Submitting machine is "TTC-EAHMED3.altera.com"
2/7 15:52:36 Error setting password on account condor-reuse-vm2
2/7 15:52:36 LogonUser(condor-reuse-vm2, ... ) failed with status
13262/7 15:52:
36 ERROR "Failed to create a user nobody" at line 332 in file
..\src\condor_c++_
util\uids.C
2/7 15:52:36 ShutdownFast all jobs.

If that info helps.

- Ian