[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Troubleshooting process



Quoting Nick LeRoy <nleroy@xxxxxxxxxxx>:

> Let's back up and go through a couple of basics, ok?  In your earlier
> message, 
> you talked about a non-execute machine, but here you're talking about
> the 
> startd; this leads me to believe that there may be some confusion which
> I can 
> hopefully help clear up.

Sorry about that confusion.  My original message referred to just
troubleshooting the server to ensure that it was fine before trying to get a
client up and running.  However, the first response I received dealt with
tshooting clients, so I figured I'd just go with it =).

> 1. Execute machines are, basically, a machine that runs a startd.  The
> execute 
> machine can also be a sumbit machine and / or a "central manager".

Anyway, I didn't want the server to be an execute machine -- only a submit and a
central manager.  However, in the documentation describing how to get multiple
interfaces to work on the server, it says to enable startd on the server.  So...
 I did.  If startd is only needed for execute nodes, however, then I don't need
it to load on the server.  I will take that out.
 
> 2. Submit machines run the schedd.
> 3. Central managers run the collector and negotiator.  There is just one
> CM 
> per pool.

Ok, so my server should be running master, schedd, collector, and negotiator. 
My clients should be running master and startd.  I will make the neccessary changes.

> 4. A single machine can fill more than one of these roles.

We run a pretty slim compute cluster, so there shouldn't be a need for much overlap.

> Above, you're talking about a startd, and state that the startd is
> exiting 
> with status 4 (I assume that you're getting this information from the 
> MasterLog).  You can start to diagnose what's going wrong with the
> startd by 
> looking in the StartLog on that machine.  If there's not enough
> information, 
> you may need to turn on more debugging by editing the machine's
> condor_config 
> and editing the STARTD_DEBUG setting; try something like this:
> STARTD_DEBUG        = D_COMMAND D_FULLDEBUG D_JOB

StartLog on the *client* is returning:
 ERROR "fopen of "/var/run/utmp"" at line 358 in file idle_time.C

I will also try running with more debugging.

> Now, let's go back to trying to get your CM working.  You'll most
> certainly 
> want to set the NETWORK_INTERFACE parameter on this machine; set it to
> the IP 
> address of the NIC that you want it to talk over.  For example, if eth0
> is 
> internal with an address of 192.168.1.10 and eth1 external and an
> address of 
> 128.105.2.6, and you wanted it to talk over eth0 (internal), you would
> have a 
> line like this:
> NETWORK_INTERFACE = 192.168.1.10

condor_config.local on the CM has NETWORK_INTERFACE = 172.16.0.1

> If you look in the MasterLog or CollectorLog of this machine, you should
> see 
> (near the top) a line like:
> 3/2 21:59:06 DaemonCore: Command Socket at <192.168.1.10:1234>
> Verify that it's listening on the right address.

On the CM machine...
MasterLog:
 DaemonCore: Command Socket at <172.16.0.1:33600>

CollectorLog:
 DaemonCore: Command Socket at <172.16.0.1:9618>

Also in the CollectorLog:
 WARNING:  No master ad for < server.beowulf >

That sounds bad =)

> If the Collector is running properly, condor_status will show a list of
> 
> startd's that are reporting to it.  If there are none, it won't report
> an 
> error, it'll just show you that there are none.  If you get an error, it
> 
> means that either it's not configured to talk to the Collector at the
> right 
> address, or that the Collector itself isn't listening at that address.

condor_status on the CM returns no output.
 
> If that happens, first verify that the collector is actually running (on
> the 
> CM).  You can use 'ps' for this.  Also, the Master should log that it
> started 
> the collector, and the collector itself will log to the CollectorLog.

The collector is running on the CM.  Also, the logs show as you say.

> Hopefully, this should be enough to get you started.

Thank you, it's definitely pointed me in a good direction.  I am guessing from
this that the CM is correctly setup.  I should probably switch to tshooting the
clients.  I am getting startd exiting with status 4 on the clients.

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>