[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Bit of a problem with HAD



On Tue January 10 2006 2:37 pm, Finch, Ralph wrote:
> condor -version
> $CondorVersion: 6.7.13 Nov  7 2005 $
> $CondorPlatform: INTEL-WINNT50 $
>
> My desktop machine and another machine are the HAD machines, and also
> serve as condor executors.

By "are the HAD machines", I assume that you mean "are the two machines that 
the negotiator can run on" (and, thus, are setup with condor_had).  Is that 
correct?

> When I installed this a few weeks ago things were working OK, though I
> don't think I tested dagman then.  Now I have these symptoms:  when I
> submit a dagman job, the jobs wait in the queue several minutes.  Then
> on my machine (MERRIT) a condor_exec.exe starts and runs full CPU speed,
> but no other jobs start to run.

There's really not much information here to go on, but let's see what we can 
do...

Is condor_had running on both machines?  Is condor_negotiator running on 
(exactly) one of the machines?  Which one?  Is one of the machines setup as 
the primary (HAD_USE_PRIMARY)?  Which one?

I'm not 100% certainly, but from a quick perusal of the code, it appears that 
the message "Haven't heard from negotiator, trying to claim local startd" 
means exactly what it says; the Schedd has heard from the negotiator for a 
while (i.e. no negotiation cycles), so as a fallback it's trying to claim the 
startds on the local machine (or something like that); I don't know why the 
"PERMISSION DENIED" is given, though (guess: it wasn't properly claimed 
through the negotiator, so you don't have a claim ticket).

So, the bottom line is that we need to figure out why we haven't had a 
negotiation cycle...  Start by answering the questions that I asked at the 
start of my reply, I think that it'll lead us to an answer.

If you find that both HADs are running, but no negotiators running, we'd need 
to see what's going on in the HadLog (both machines), the MasterLog (both), 
and NegotiatorLog (again, both machines).

On your own, you can look in the HadLogs to see which machine thinks it's the 
leader, then look in the MasterLog to verify that it tried to start the 
Negotiator properly, and the NegotiatorLog to verify that it actually started 
properly.

Finally, I'd like to note that the 6.7.14 master and HAD can better handle 
cases in which the HAD tells the master "start the negotiator", but the 
master is unable to do so for whatever reason.  If you are upgrading to 
6.7.14, however, make sure that you upgrade both the master and the HAD 
together; *bad* things will happen if you don't...

-Nick

-- 
           <<< The answer is out there, Neo. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences