[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status and condor_q disagree about state ofvm's



Daniel:

Bingo! There were in fact two startd's running. You have made my day (and my weekend). Thanks a bunch.

Regards,
Bob

Daniel Forrest wrote:
Bob,

I've spent the last couple of days looking for an answer to this
issue and searched the archives, but came up empty handed.  If this
has been addressed before please excuse the rehash.

I've got a small pool of two SMP machines, both with dual dual-core
Opteron processors.  In the default configuration that's 8 vm's.  I
would expect that this would mean that I should never be able to
have more than 8 jobs running in this pool at any given time, but
I have been able to do just that.

For (as of yet) undetermined reasons, the schedd will not recognize
that a startd is running for on some vms.  See below the (trimmed)
results of a condor_status:

Name          OpSys       Arch   State      Activity

vm1@server-1  LINUX       X86_64 Unclaimed  Idle
vm2@server-1  LINUX       X86_64 Unclaimed  Idle
vm3@server-1  LINUX       X86_64 Claimed    Busy
vm4@server-1  LINUX       X86_64 Unclaimed  Idle
vm1@server-2  LINUX       X86_64 Unclaimed  Idle
vm2@server-2  LINUX       X86_64 Unclaimed  Idle
vm3@server-2  LINUX       X86_64 Claimed    Busy
vm4@server-2  LINUX       X86_64 Claimed    Busy

Now look at the (trimmed) results of a condor_q -running:

ID      HOST(S)
68.0   vm4@server-1
69.0   vm4@server-2
70.0   vm3@server-1
71.0   vm3@server-2

notice that vm4 on server-1 is running a job, but shows up as
Unclaimed/Idle.  Does anyone have an explanation of why this might
happen or what I can do to further debug the issue?

I have seen this type of behavior before.  Check to be sure that there
is only one condor_startd process running on server-1.  I have seen
cases where there are two condor_masters, each with a condor_startd,
and what you see in condor_status is the status of the condor_startd
that has most recently sent an update to your condor_collector.



--
Earl (Bob) Kinney
UNIX Systems Administrator
Harvard-MIT Data Center