[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status and condor_q disagree about state ofvm's



Thanks for the reply. condor_status -direct doesn't show anything different unfortunately. And yes I'm currently running startd on the central manager, which is the server that has the issue. Mayhaps this is the issue in first place. I hope not as it's going to be difficult to get additional hardware at this point in my project.

Regards,
Bob

Kewley, J (John) wrote:
Have you tried using the -direct option to condor_status to get the
info from the node itself rather than from the central node?

BTW do you have a startd on your central node too? If so,
you should be careful, there may be security implications of that.

Cheers

JK

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Bob Kinney
Sent: Friday, April 20, 2007 6:50 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] condor_status and condor_q disagree about state
ofvm's


Hi:

I've spent the last couple of days looking for an answer to this issue and searched the archives, but came up empty handed. If this has been addressed before please excuse the rehash.

I've got a small pool of two SMP machines, both with dual dual-core Opteron processors. In the default configuration that's 8 vm's. I would expect that this would mean that I should never be able to have more than 8 jobs running in this pool at any given time, but I have been able to do just that.

For (as of yet) undetermined reasons, the schedd will not recognize that a startd is running for on some vms. See below the (trimmed) results of a condor_status:

Name          OpSys       Arch   State      Activity

vm1@server-1  LINUX       X86_64 Unclaimed  Idle
vm2@server-1  LINUX       X86_64 Unclaimed  Idle
vm3@server-1  LINUX       X86_64 Claimed    Busy
vm4@server-1  LINUX       X86_64 Unclaimed  Idle
vm1@server-2  LINUX       X86_64 Unclaimed  Idle
vm2@server-2  LINUX       X86_64 Unclaimed  Idle
vm3@server-2  LINUX       X86_64 Claimed    Busy
vm4@server-2  LINUX       X86_64 Claimed    Busy

Now look at the (trimmed) results of a condor_q -running:

ID      HOST(S)
68.0   vm4@server-1
69.0   vm4@server-2
70.0   vm3@server-1
71.0   vm3@server-2

notice that vm4 on server-1 is running a job, but shows up as Unclaimed/Idle. Does anyone have an explanation of why this might happen or what I can do to further debug the issue?

Some other information that might be relevant:

* server-1 is the central manager for this pool and runs a schedd
* jobs are remotely submitted from other hosts to the schedd on server-1 * server-2 does not seem to have the same issue (i.e. condor_status always reports the correct results). * if other jobs are submitted to run on server-1 the vm's that will report Claimed/Busy will change (i.e. vm3 will be Idle, vm4 will be Busy).

Thanks in advance to any assistance anyone can offer.

Regards,
Bob

--
Earl (Bob) Kinney
UNIX Systems Administrator
Harvard-MIT Data Center
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


--
Earl (Bob) Kinney
UNIX Systems Administrator
Harvard-MIT Data Center