Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Bit of a problem with HAD
- Date: Tue, 10 Jan 2006 15:05:42 -0600
- From: Nick LeRoy <nleroy@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Bit of a problem with HAD
On Tue January 10 2006 2:37 pm, Finch, Ralph wrote:
> condor -version
> $CondorVersion: 6.7.13 Nov 7 2005 $
> $CondorPlatform: INTEL-WINNT50 $
>
> My desktop machine and another machine are the HAD machines, and also
> serve as condor executors.
By "are the HAD machines", I assume that you mean "are the two machines that
the negotiator can run on" (and, thus, are setup with condor_had). Is that
correct?
> When I installed this a few weeks ago things were working OK, though I
> don't think I tested dagman then. Now I have these symptoms: when I
> submit a dagman job, the jobs wait in the queue several minutes. Then
> on my machine (MERRIT) a condor_exec.exe starts and runs full CPU speed,
> but no other jobs start to run.
There's really not much information here to go on, but let's see what we can
do...
Is condor_had running on both machines? Is condor_negotiator running on
(exactly) one of the machines? Which one? Is one of the machines setup as
the primary (HAD_USE_PRIMARY)? Which one?
I'm not 100% certainly, but from a quick perusal of the code, it appears that
the message "Haven't heard from negotiator, trying to claim local startd"
means exactly what it says; the Schedd has heard from the negotiator for a
while (i.e. no negotiation cycles), so as a fallback it's trying to claim the
startds on the local machine (or something like that); I don't know why the
"PERMISSION DENIED" is given, though (guess: it wasn't properly claimed
through the negotiator, so you don't have a claim ticket).
So, the bottom line is that we need to figure out why we haven't had a
negotiation cycle... Start by answering the questions that I asked at the
start of my reply, I think that it'll lead us to an answer.
If you find that both HADs are running, but no negotiators running, we'd need
to see what's going on in the HadLog (both machines), the MasterLog (both),
and NegotiatorLog (again, both machines).
On your own, you can look in the HadLogs to see which machine thinks it's the
leader, then look in the MasterLog to verify that it tried to start the
Negotiator properly, and the NegotiatorLog to verify that it actually started
properly.
Finally, I'd like to note that the 6.7.14 master and HAD can better handle
cases in which the HAD tells the master "start the negotiator", but the
master is unable to do so for whatever reason. If you are upgrading to
6.7.14, however, make sure that you upgrade both the master and the HAD
together; *bad* things will happen if you don't...
-Nick
--
<<< The answer is out there, Neo. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy@xxxxxxxxxxx The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences