[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Bit of a problem with HAD



condor -version
$CondorVersion: 6.7.13 Nov  7 2005 $
$CondorPlatform: INTEL-WINNT50 $

My desktop machine and another machine are the HAD machines, and also
serve as condor executors.

When I installed this a few weeks ago things were working OK, though I
don't think I tested dagman then.  Now I have these symptoms:  when I
submit a dagman job, the jobs wait in the queue several minutes.  Then
on my machine (MERRIT) a condor_exec.exe starts and runs full CPU speed,
but no other jobs start to run.

I also get this in MERRIT's SchedLog:

1/10 12:24:19 (pid:2144) Sent ad to central manager for
rfinch@xxxxxxxxxxxx
1/10 12:24:19 (pid:2144) Sent ad to 2 collectors for rfinch@xxxxxxxxxxxx
1/10 12:24:19 (pid:2144) Haven't heard from negotiator, trying to claim
local startd
1/10 12:24:19 (pid:2144) Claiming local startd vm 2 at
<136.200.YYYYY:1219>
1/10 12:24:19 (pid:2144) Negotiator gone, trying to use our local startd
1/10 12:24:27 (pid:2144) Starting add_shadow_birthdate(1287.0)
1/10 12:24:27 (pid:2144) Started shadow for job 1287.0 on
"<136.200.YYYYY:1219>", (shadow pid = 1492)
1/10 12:24:27 (pid:2144) Sent ad to central manager for
rfinch@xxxxxxxxxxxx
1/10 12:24:27 (pid:2144) Sent ad to 2 collectors for rfinch@xxxxxxxxxxxx
1/10 12:24:27 (pid:2144) Haven't heard from negotiator, trying to claim
local startd
1/10 12:24:32 (pid:2144) DaemonCore: PERMISSION DENIED to unknown user
from host <136.200.XXXXX:1831> for command 416 (NEGOTIATE)

YYYYY is MERRIT, XXXXX is the other HAD machine (delta-mod).

The HADLog and CollectorLog show no problems.

Any clues appreciated.

Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA  95814
916-653-7552
rfinch@xxxxxxxxxxxx