[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] "Claimed Idle" state on XP execute nodes, sched still thinks they're running



Hi

We have a couple of flocked pools, with the submit machine being in one pool and the execute XP boxes being in the other. We're seeing jobs getting successfully matched and initiated, but after a while all the jobs go into a "Claimed Idle" state on the execute nodes, whereas a condor_q on the submit node gives their status as running. Is there a way to automaticaly recover from this state? For example, here's a snipped from the StartLog on one of the XP nodes when it changes state:

2/3 19:58:26 DaemonCore: Command received via UDP from host <172.24.116.233:9500>
2/3 19:58:26 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
2/3 19:58:26 Starter pid 3008 exited with status 4
2/3 19:58:26 State change: starter exited
2/3 19:58:26 Changing activity: Busy -> Idle


The StarterLog has:

2/3 19:58:26 condor_read(): recv() returned -1, errno = 10054, assuming failure.
2/3 19:58:26 ERROR "Assertion ERROR on (result)" at line 270 in file ..\src\condor_starter.V6.1\NTsenders.C


This looks like a network error, e.g. packet loss, right? If so, why doesn't the schedd pick up on this? This particular machine has been in this state now for approximately 14 hours. Is there a setting we can tweak on the submit node to make it aware of the situation more quickly?

Cheers,
Mark