Bob,
I've spent the last couple of days looking for an answer to this
issue and searched the archives, but came up empty handed. If this
has been addressed before please excuse the rehash.
I've got a small pool of two SMP machines, both with dual dual-core
Opteron processors. In the default configuration that's 8 vm's. I
would expect that this would mean that I should never be able to
have more than 8 jobs running in this pool at any given time, but
I have been able to do just that.
For (as of yet) undetermined reasons, the schedd will not recognize
that a startd is running for on some vms. See below the (trimmed)
results of a condor_status:
Name OpSys Arch State Activity
vm1@server-1 LINUX X86_64 Unclaimed Idle
vm2@server-1 LINUX X86_64 Unclaimed Idle
vm3@server-1 LINUX X86_64 Claimed Busy
vm4@server-1 LINUX X86_64 Unclaimed Idle
vm1@server-2 LINUX X86_64 Unclaimed Idle
vm2@server-2 LINUX X86_64 Unclaimed Idle
vm3@server-2 LINUX X86_64 Claimed Busy
vm4@server-2 LINUX X86_64 Claimed Busy
Now look at the (trimmed) results of a condor_q -running:
ID HOST(S)
68.0 vm4@server-1
69.0 vm4@server-2
70.0 vm3@server-1
71.0 vm3@server-2
notice that vm4 on server-1 is running a job, but shows up as
Unclaimed/Idle. Does anyone have an explanation of why this might
happen or what I can do to further debug the issue?
I have seen this type of behavior before. Check to be sure that there
is only one condor_startd process running on server-1. I have seen
cases where there are two condor_masters, each with a condor_startd,
and what you see in condor_status is the status of the condor_startd
that has most recently sent an update to your condor_collector.