[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Bit of a problem with HAD



> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Nick LeRoy
> Sent: Tuesday, January 10, 2006 1:06 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Bit of a problem with HAD
> 
> On Tue January 10 2006 2:37 pm, Finch, Ralph wrote:
> > condor -version
> > $CondorVersion: 6.7.13 Nov  7 2005 $
> > $CondorPlatform: INTEL-WINNT50 $
> >
> > My desktop machine and another machine are the HAD 
> machines, and also
> > serve as condor executors.
> 
> By "are the HAD machines", I assume that you mean "are the 
> two machines that 
> the negotiator can run on" (and, thus, are setup with 
> condor_had).  Is that 
> correct?

Yes.  Sorry for my poor terminology.

> > When I installed this a few weeks ago things were working 
> OK, though I
> > don't think I tested dagman then.  Now I have these 
> symptoms:  when I
> > submit a dagman job, the jobs wait in the queue several 
> minutes.  Then
> > on my machine (MERRIT) a condor_exec.exe starts and runs 
> full CPU speed,
> > but no other jobs start to run.

BTW, I'm fairly sure that condor_exec.exe is the desired job,
but I don't recall seeing that .exe name before; I was expecting
hydro.exe.

It's also puzzling why it would run on my machine, 'cause
mine is one of the slowest of the pool.

 
> Is condor_had running on both machines?

Yes, I just checked to be sure.

  Is condor_negotiator 
> running on 
> (exactly) one of the machines?  Which one?  Is one of the 
> machines setup as 
> the primary (HAD_USE_PRIMARY)?  Which one?

Yes; delta-mod.  Yes; delta-mod. I checked these again to be sure.

> On your own, you can look in the HadLogs to see which machine 
> thinks it's the 
> leader, then look in the MasterLog to verify that it tried to 
> start the 
> Negotiator properly, and the NegotiatorLog to verify that it 
> actually started 
> properly.

There's something odd in the negotiator log...

delta-mod HADLog:

1/10 14:21:07 ******************************************************
1/10 14:21:07 Using config file: Z:\Condor\condor_config
1/10 14:21:07 Using local config files: Z:/Condor/condor_config.local
1/10 14:21:07 DaemonCore: Command Socket at <136.200.32.102:9450>
1/10 14:21:07 Starting HAD ....
1/10 14:21:07 ** Register on stateMachineTimerID , interval = 21
1/10 14:21:07 ** HAD_ID   1
1/10 14:21:07 ** HAD_CYCLE_INTERVAL   42
1/10 14:21:07 ** HAD_CONNECTION_TIMEOUT   5
1/10 14:21:07 ** HAD_USE_PRIMARY(true/false)   1
1/10 14:21:07 ** AM I PRIMARY ?(true/false)   1
1/10 14:21:07 ** HAD_LIST(others only)
1/10 14:21:07 **    <136.200.32.182:9450>
1/10 14:21:07 ** HAD_STAND_ALONE_DEBUG(true/false)    0
1/10 14:21:53 DaemonCore: Command received via TCP from host
<136.200.32.182:4623>
1/10 14:21:53 DaemonCore: received command 701 (SEND ID command),
calling handler (commandHandler)
1/10 14:22:14 DaemonCore: Command received via TCP from host
<136.200.32.182:4636>
1/10 14:22:14 DaemonCore: received command 701 (SEND ID command),
calling handler (commandHandler)

merrit HADLog:

1/10 14:44:43 DaemonCore: Command received via TCP from host
<136.200.32.102:4029>
1/10 14:44:43 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:45:04 DaemonCore: Command received via TCP from host
<136.200.32.102:4044>
1/10 14:45:04 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:45:25 DaemonCore: Command received via TCP from host
<136.200.32.102:4059>
1/10 14:45:25 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:45:46 DaemonCore: Command received via TCP from host
<136.200.32.102:4076>
1/10 14:45:46 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:46:07 DaemonCore: Command received via TCP from host
<136.200.32.102:4091>
1/10 14:46:07 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:46:28 DaemonCore: Command received via TCP from host
<136.200.32.102:4106>
1/10 14:46:28 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)
1/10 14:46:49 DaemonCore: Command received via TCP from host
<136.200.32.102:4123>
1/10 14:46:49 DaemonCore: received command 700 (ALIVE command), calling
handler (commandHandler)

delta-mod MasterLog:

1/10 14:21:07 WinFirewall: get_CurrentProfile failed: 0x800706d9
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_collector.exe", pid and pgroup = 2788
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_startd.exe", pid and pgroup = 4072
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_schedd.exe", pid and pgroup = 2916
1/10 14:21:07 Started DaemonCore process
"Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 3628
1/10 14:21:07 Started DaemonCore process "Z:/Condor/bin/condor_had.exe",
pid and pgroup = 2288
1/10 14:21:07 DaemonCore: Command received via TCP from host
<136.200.32.102:2907>
1/10 14:21:07 DaemonCore: received command 468 (DAEMON_OFF_FAST),
calling handler (admin_command_handler)
1/10 14:21:07 Handling daemon-specific command for "negotiator"
1/10 14:21:08 Sent signal 3 to NEGOTIATOR (pid 3628)
1/10 14:21:11 DaemonCore: Command received via UDP from host
<136.200.32.102:2936>
1/10 14:21:11 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
1/10 14:21:11 The NEGOTIATOR (pid 3628) exited with status 0
1/10 14:22:31 DaemonCore: Command received via TCP from host
<136.200.32.102:3009>
1/10 14:22:31 DaemonCore: received command 469 (DAEMON_ON), calling
handler (admin_command_handler)
1/10 14:22:31 Handling daemon-specific command for "negotiator"
1/10 14:22:31 Started DaemonCore process
"Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 3560

merrit MasterLog:

1/10 14:21:40 WinFirewall: get_CurrentProfile failed: 0x800706d9
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_collector.exe", pid and pgroup = 524
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_startd.exe", pid and pgroup = 3352
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_schedd.exe", pid and pgroup = 3876
1/10 14:21:40 Started DaemonCore process
"Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 2360
1/10 14:21:40 Started DaemonCore process "Z:/Condor/bin/condor_had.exe",
pid and pgroup = 2456
1/10 14:21:40 DaemonCore: Command received via TCP from host
<136.200.32.182:4559>
1/10 14:21:40 DaemonCore: received command 468 (DAEMON_OFF_FAST),
calling handler (admin_command_handler)
1/10 14:21:40 Handling daemon-specific command for "negotiator"
1/10 14:21:40 Sent signal 3 to NEGOTIATOR (pid 2360)
1/10 14:21:40 DaemonCore: Command received via UDP from host
<136.200.32.182:4569>
1/10 14:21:40 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
1/10 14:21:40 The NEGOTIATOR (pid 2360) exited with status 0

delta-mod NegotiatorLog (hmmmm, something awry):

1/10 14:32:32 Phase 1:  Obtaining ads from collector ...
1/10 14:32:32   Getting all public ads ...
1/10 14:32:32   Sorting 58 ads ...
1/10 14:32:32   Getting startd private ads ...
1/10 14:32:32 Got ads: 58 public and 28 private
1/10 14:32:32 Public ads include 1 submitter, 28 startd
1/10 14:32:32 Phase 2:  Performing accounting ...
1/10 14:32:32 Phase 3:  Sorting submitter ads by priority ...
1/10 14:32:32 Phase 4.1:  Negotiating with schedds ...
1/10 14:32:32   Negotiating with rfinch@xxxxxxxxxxxx at
<136.200.32.182:4553>
1/10 14:32:32 0 seconds so far
1/10 14:32:32 condor_read(): recv() returned -1, errno = 10054, assuming
failure.
1/10 14:32:32 IO: Failed to read packet header
1/10 14:32:32     Failed to get reply from schedd
1/10 14:32:32   Error: Ignoring schedd for this cycle
1/10 14:32:32 ---------- Finished Negotiation Cycle ----------

merrit NegotiatorLog:

1/10 14:21:40 ******************************************************
1/10 14:21:40 Using config file: z:\Condor\condor_config
1/10 14:21:40 Using local config files: Z:/Condor/condor_config.local
1/10 14:21:40 DaemonCore: Command Socket at <136.200.32.182:4554>
1/10 14:21:40 ACCOUNTANT_HOST = None (local)
1/10 14:21:40 NEGOTIATOR_INTERVAL = 300 sec
1/10 14:21:40 NEGOTIATOR_TIMEOUT = 30 sec
1/10 14:21:40 MAX_TIME_PER_SUBMITTER = 31536000 sec
1/10 14:21:40 MAX_TIME_PER_PIESPIN = 31536000 sec
1/10 14:21:40 PREEMPTION_REQUIREMENTS = FALSE
1/10 14:21:40 PREEMPTION_RANK = None
1/10 14:21:40 NEGOTIATOR_PRE_JOB_RANK = None
1/10 14:21:40 NEGOTIATOR_POST_JOB_RANK = None
1/10 14:21:40 ---------- Started Negotiation Cycle ----------
1/10 14:21:40 Phase 1:  Obtaining ads from collector ...
1/10 14:21:40   Getting all public ads ...
1/10 14:21:40   Sorting 0 ads ...
1/10 14:21:40   Getting startd private ads ...
1/10 14:21:40 Got ads: 0 public and 0 private
1/10 14:21:40 Public ads include 0 submitter, 0 startd
1/10 14:21:40 Phase 2:  Performing accounting ...
1/10 14:21:40 Phase 3:  Sorting submitter ads by priority ...
1/10 14:21:40 Phase 4.1:  Negotiating with schedds ...
1/10 14:21:40 ---------- Finished Negotiation Cycle ----------
1/10 14:21:40 Got SIGQUIT.  Performing fast shutdown.
1/10 14:21:40 **** condor_negotiator.exe (condor_NEGOTIATOR) EXITING
WITH STATUS 0


> Finally, I'd like to note that the 6.7.14 master and HAD can 
> better handle 
> cases in which the HAD tells the master "start the 
> negotiator", but the 
> master is unable to do so for whatever reason.  If you are 
> upgrading to 
> 6.7.14, however, make sure that you upgrade both the master 
> and the HAD 
> together; *bad* things will happen if you don't...

OK; I can do the upgrade if you think it a good idea.  Thanks much.

Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA  95814
916-653-7552
rfinch@xxxxxxxxxxxx