[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Bit of a problem with HAD



Hi

It seems from the logs that everything is OK with HADs and Neg.
Neg, on the other hand, can not talk to schedd.
According to the last line in the schedd log:
12:24:32 (pid:2144) DaemonCore: PERMISSION DENIED to unknown user from host
<136.200.XXXXX:1831> for command 416 (NEGOTIATE)

So I think you need to check this permission issue between Neg and schedd.

Also, I do not think it has something to do with upgrading to 6.7.14. since
the changes introduced there only concern HAD-Neg mechanism, which seems
working fine.

Did you set:

HOSTALLOW_NEGOTIATOR = merit, delta-mod
HOSTALLOW_NEGOTIATOR_SCHEDD = merit, delta-mod

To make sure, can you please send us the output of:
condor_config_val HOSTALLOW_NEGOTIATOR
condor_config_val HOSTALLOW_NEGOTIATOR_SCHEDD 

Thank you

Gabi




> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Finch, Ralph
> Sent: Wednesday, January 11, 2006 1:05 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Bit of a problem with HAD
> 
> 
> > From: condor-users-bounces@xxxxxxxxxxx
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Nick LeRoy
> > Sent: Tuesday, January 10, 2006 1:06 PM
> > To: Condor-Users Mail List
> > Subject: Re: [Condor-users] Bit of a problem with HAD
> > 
> > On Tue January 10 2006 2:37 pm, Finch, Ralph wrote:
> > > condor -version
> > > $CondorVersion: 6.7.13 Nov  7 2005 $
> > > $CondorPlatform: INTEL-WINNT50 $
> > >
> > > My desktop machine and another machine are the HAD
> > machines, and also
> > > serve as condor executors.
> > 
> > By "are the HAD machines", I assume that you mean "are the
> > two machines that 
> > the negotiator can run on" (and, thus, are setup with 
> > condor_had).  Is that 
> > correct?
> 
> Yes.  Sorry for my poor terminology.
> 
> > > When I installed this a few weeks ago things were working
> > OK, though I
> > > don't think I tested dagman then.  Now I have these
> > symptoms:  when I
> > > submit a dagman job, the jobs wait in the queue several
> > minutes.  Then
> > > on my machine (MERRIT) a condor_exec.exe starts and runs
> > full CPU speed,
> > > but no other jobs start to run.
> 
> BTW, I'm fairly sure that condor_exec.exe is the desired job, 
> but I don't recall seeing that .exe name before; I was 
> expecting hydro.exe.
> 
> It's also puzzling why it would run on my machine, 'cause
> mine is one of the slowest of the pool.
> 
>  
> > Is condor_had running on both machines?
> 
> Yes, I just checked to be sure.
> 
>   Is condor_negotiator 
> > running on
> > (exactly) one of the machines?  Which one?  Is one of the 
> > machines setup as 
> > the primary (HAD_USE_PRIMARY)?  Which one?
> 
> Yes; delta-mod.  Yes; delta-mod. I checked these again to be sure.
> 
> > On your own, you can look in the HadLogs to see which machine
> > thinks it's the 
> > leader, then look in the MasterLog to verify that it tried to 
> > start the 
> > Negotiator properly, and the NegotiatorLog to verify that it 
> > actually started 
> > properly.
> 
> There's something odd in the negotiator log...
> 
> delta-mod HADLog:
> 
> 1/10 14:21:07 ******************************************************
> 1/10 14:21:07 Using config file: Z:\Condor\condor_config
> 1/10 14:21:07 Using local config files: 
> Z:/Condor/condor_config.local 1/10 14:21:07 DaemonCore: 
> Command Socket at <136.200.32.102:9450> 1/10 14:21:07 
> Starting HAD .... 1/10 14:21:07 ** Register on 
> stateMachineTimerID , interval = 21
> 1/10 14:21:07 ** HAD_ID   1
> 1/10 14:21:07 ** HAD_CYCLE_INTERVAL   42
> 1/10 14:21:07 ** HAD_CONNECTION_TIMEOUT   5
> 1/10 14:21:07 ** HAD_USE_PRIMARY(true/false)   1
> 1/10 14:21:07 ** AM I PRIMARY ?(true/false)   1
> 1/10 14:21:07 ** HAD_LIST(others only)
> 1/10 14:21:07 **    <136.200.32.182:9450>
> 1/10 14:21:07 ** HAD_STAND_ALONE_DEBUG(true/false)    0
> 1/10 14:21:53 DaemonCore: Command received via TCP from host 
> <136.200.32.182:4623> 1/10 14:21:53 DaemonCore: received 
> command 701 (SEND ID command), calling handler 
> (commandHandler) 1/10 14:22:14 DaemonCore: Command received 
> via TCP from host <136.200.32.182:4636> 1/10 14:22:14 
> DaemonCore: received command 701 (SEND ID command), calling 
> handler (commandHandler)
> 
> merrit HADLog:
> 
> 1/10 14:44:43 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4029> 1/10 14:44:43 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler) 
> 1/10 14:45:04 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4044> 1/10 14:45:04 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler) 
> 1/10 14:45:25 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4059> 1/10 14:45:25 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler) 
> 1/10 14:45:46 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4076> 1/10 14:45:46 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler) 
> 1/10 14:46:07 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4091> 1/10 14:46:07 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler) 
> 1/10 14:46:28 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4106> 1/10 14:46:28 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler) 
> 1/10 14:46:49 DaemonCore: Command received via TCP from host 
> <136.200.32.102:4123> 1/10 14:46:49 DaemonCore: received 
> command 700 (ALIVE command), calling handler (commandHandler)
> 
> delta-mod MasterLog:
> 
> 1/10 14:21:07 WinFirewall: get_CurrentProfile failed: 
> 0x800706d9 1/10 14:21:07 Started DaemonCore process 
> "Z:/Condor/bin/condor_collector.exe", pid and pgroup = 2788 
> 1/10 14:21:07 Started DaemonCore process 
> "Z:/Condor/bin/condor_startd.exe", pid and pgroup = 4072 1/10 
> 14:21:07 Started DaemonCore process 
> "Z:/Condor/bin/condor_schedd.exe", pid and pgroup = 2916 1/10 
> 14:21:07 Started DaemonCore process 
> "Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 3628 
> 1/10 14:21:07 Started DaemonCore process 
> "Z:/Condor/bin/condor_had.exe", pid and pgroup = 2288 1/10 
> 14:21:07 DaemonCore: Command received via TCP from host 
> <136.200.32.102:2907> 1/10 14:21:07 DaemonCore: received 
> command 468 (DAEMON_OFF_FAST), calling handler 
> (admin_command_handler) 1/10 14:21:07 Handling 
> daemon-specific command for "negotiator" 1/10 14:21:08 Sent 
> signal 3 to NEGOTIATOR (pid 3628) 1/10 14:21:11 DaemonCore: 
> Command received via UDP from host <136.200.32.102:2936> 1/10 
> 14:21:11 DaemonCore: received command 60011 (DC_NOP), calling 
> handler (handle_nop()) 1/10 14:21:11 The NEGOTIATOR (pid 
> 3628) exited with status 0 1/10 14:22:31 DaemonCore: Command 
> received via TCP from host <136.200.32.102:3009> 1/10 
> 14:22:31 DaemonCore: received command 469 (DAEMON_ON), 
> calling handler (admin_command_handler) 1/10 14:22:31 
> Handling daemon-specific command for "negotiator" 1/10 
> 14:22:31 Started DaemonCore process 
> "Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 3560
> 
> merrit MasterLog:
> 
> 1/10 14:21:40 WinFirewall: get_CurrentProfile failed: 
> 0x800706d9 1/10 14:21:40 Started DaemonCore process 
> "Z:/Condor/bin/condor_collector.exe", pid and pgroup = 524 
> 1/10 14:21:40 Started DaemonCore process 
> "Z:/Condor/bin/condor_startd.exe", pid and pgroup = 3352 1/10 
> 14:21:40 Started DaemonCore process 
> "Z:/Condor/bin/condor_schedd.exe", pid and pgroup = 3876 1/10 
> 14:21:40 Started DaemonCore process 
> "Z:/Condor/bin/condor_negotiator.exe", pid and pgroup = 2360 
> 1/10 14:21:40 Started DaemonCore process 
> "Z:/Condor/bin/condor_had.exe", pid and pgroup = 2456 1/10 
> 14:21:40 DaemonCore: Command received via TCP from host 
> <136.200.32.182:4559> 1/10 14:21:40 DaemonCore: received 
> command 468 (DAEMON_OFF_FAST), calling handler 
> (admin_command_handler) 1/10 14:21:40 Handling 
> daemon-specific command for "negotiator" 1/10 14:21:40 Sent 
> signal 3 to NEGOTIATOR (pid 2360) 1/10 14:21:40 DaemonCore: 
> Command received via UDP from host <136.200.32.182:4569> 1/10 
> 14:21:40 DaemonCore: received command 60011 (DC_NOP), calling 
> handler (handle_nop()) 1/10 14:21:40 The NEGOTIATOR (pid 
> 2360) exited with status 0
> 
> delta-mod NegotiatorLog (hmmmm, something awry):
> 
> 1/10 14:32:32 Phase 1:  Obtaining ads from collector ...
> 1/10 14:32:32   Getting all public ads ...
> 1/10 14:32:32   Sorting 58 ads ...
> 1/10 14:32:32   Getting startd private ads ...
> 1/10 14:32:32 Got ads: 58 public and 28 private
> 1/10 14:32:32 Public ads include 1 submitter, 28 startd
> 1/10 14:32:32 Phase 2:  Performing accounting ...
> 1/10 14:32:32 Phase 3:  Sorting submitter ads by priority ... 
> 1/10 14:32:32 Phase 4.1:  Negotiating with schedds ...
> 1/10 14:32:32   Negotiating with rfinch@xxxxxxxxxxxx at
> <136.200.32.182:4553>
> 1/10 14:32:32 0 seconds so far
> 1/10 14:32:32 condor_read(): recv() returned -1, errno = 
> 10054, assuming failure. 1/10 14:32:32 IO: Failed to read 
> packet header
> 1/10 14:32:32     Failed to get reply from schedd
> 1/10 14:32:32   Error: Ignoring schedd for this cycle
> 1/10 14:32:32 ---------- Finished Negotiation Cycle ----------
> 
> merrit NegotiatorLog:
> 
> 1/10 14:21:40 ******************************************************
> 1/10 14:21:40 Using config file: z:\Condor\condor_config
> 1/10 14:21:40 Using local config files: 
> Z:/Condor/condor_config.local 1/10 14:21:40 DaemonCore: 
> Command Socket at <136.200.32.182:4554> 1/10 14:21:40 
> ACCOUNTANT_HOST = None (local) 1/10 14:21:40 
> NEGOTIATOR_INTERVAL = 300 sec 1/10 14:21:40 
> NEGOTIATOR_TIMEOUT = 30 sec 1/10 14:21:40 
> MAX_TIME_PER_SUBMITTER = 31536000 sec 1/10 14:21:40 
> MAX_TIME_PER_PIESPIN = 31536000 sec 1/10 14:21:40 
> PREEMPTION_REQUIREMENTS = FALSE 1/10 14:21:40 PREEMPTION_RANK 
> = None 1/10 14:21:40 NEGOTIATOR_PRE_JOB_RANK = None 1/10 
> 14:21:40 NEGOTIATOR_POST_JOB_RANK = None 1/10 14:21:40 
> ---------- Started Negotiation Cycle ---------- 1/10 14:21:40 
> Phase 1:  Obtaining ads from collector ...
> 1/10 14:21:40   Getting all public ads ...
> 1/10 14:21:40   Sorting 0 ads ...
> 1/10 14:21:40   Getting startd private ads ...
> 1/10 14:21:40 Got ads: 0 public and 0 private
> 1/10 14:21:40 Public ads include 0 submitter, 0 startd
> 1/10 14:21:40 Phase 2:  Performing accounting ...
> 1/10 14:21:40 Phase 3:  Sorting submitter ads by priority ... 
> 1/10 14:21:40 Phase 4.1:  Negotiating with schedds ... 1/10 
> 14:21:40 ---------- Finished Negotiation Cycle ---------- 
> 1/10 14:21:40 Got SIGQUIT.  Performing fast shutdown. 1/10 
> 14:21:40 **** condor_negotiator.exe (condor_NEGOTIATOR) 
> EXITING WITH STATUS 0
> 
> 
> > Finally, I'd like to note that the 6.7.14 master and HAD can
> > better handle 
> > cases in which the HAD tells the master "start the 
> > negotiator", but the 
> > master is unable to do so for whatever reason.  If you are 
> > upgrading to 
> > 6.7.14, however, make sure that you upgrade both the master 
> > and the HAD 
> > together; *bad* things will happen if you don't...
> 
> OK; I can do the upgrade if you think it a good idea.  Thanks much.
> 
> Ralph Finch, P.E.
> Dept. of Water Resources
> Bay-Delta Office, Room 215-13
> Sacramento, CA  95814
> 916-653-7552
> rfinch@xxxxxxxxxxxx
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx 
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>