[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Troubleshooting process
- Date: Wed, 03 Mar 2004 16:16:13 -0500 (EST)
- From: kge2@xxxxxxxx
- Subject: Re: [condor-users] Troubleshooting process
Quoting Nick LeRoy <nleroy@xxxxxxxxxxx>:
> Let's back up and go through a couple of basics, ok? In your earlier
> you talked about a non-execute machine, but here you're talking about
> startd; this leads me to believe that there may be some confusion which
> I can
> hopefully help clear up.
Sorry about that confusion. My original message referred to just
troubleshooting the server to ensure that it was fine before trying to get a
client up and running. However, the first response I received dealt with
tshooting clients, so I figured I'd just go with it =).
> 1. Execute machines are, basically, a machine that runs a startd. The
> machine can also be a sumbit machine and / or a "central manager".
Anyway, I didn't want the server to be an execute machine -- only a submit and a
central manager. However, in the documentation describing how to get multiple
interfaces to work on the server, it says to enable startd on the server. So...
I did. If startd is only needed for execute nodes, however, then I don't need
it to load on the server. I will take that out.
> 2. Submit machines run the schedd.
> 3. Central managers run the collector and negotiator. There is just one
> per pool.
Ok, so my server should be running master, schedd, collector, and negotiator.
My clients should be running master and startd. I will make the neccessary changes.
> 4. A single machine can fill more than one of these roles.
We run a pretty slim compute cluster, so there shouldn't be a need for much overlap.
> Above, you're talking about a startd, and state that the startd is
> with status 4 (I assume that you're getting this information from the
> MasterLog). You can start to diagnose what's going wrong with the
> startd by
> looking in the StartLog on that machine. If there's not enough
> you may need to turn on more debugging by editing the machine's
> and editing the STARTD_DEBUG setting; try something like this:
> STARTD_DEBUG = D_COMMAND D_FULLDEBUG D_JOB
StartLog on the *client* is returning:
ERROR "fopen of "/var/run/utmp"" at line 358 in file idle_time.C
I will also try running with more debugging.
> Now, let's go back to trying to get your CM working. You'll most
> want to set the NETWORK_INTERFACE parameter on this machine; set it to
> the IP
> address of the NIC that you want it to talk over. For example, if eth0
> internal with an address of 192.168.1.10 and eth1 external and an
> address of
> 184.108.40.206, and you wanted it to talk over eth0 (internal), you would
> have a
> line like this:
> NETWORK_INTERFACE = 192.168.1.10
condor_config.local on the CM has NETWORK_INTERFACE = 172.16.0.1
> If you look in the MasterLog or CollectorLog of this machine, you should
> (near the top) a line like:
> 3/2 21:59:06 DaemonCore: Command Socket at <192.168.1.10:1234>
> Verify that it's listening on the right address.
On the CM machine...
DaemonCore: Command Socket at <172.16.0.1:33600>
DaemonCore: Command Socket at <172.16.0.1:9618>
Also in the CollectorLog:
WARNING: No master ad for < server.beowulf >
That sounds bad =)
> If the Collector is running properly, condor_status will show a list of
> startd's that are reporting to it. If there are none, it won't report
> error, it'll just show you that there are none. If you get an error, it
> means that either it's not configured to talk to the Collector at the
> address, or that the Collector itself isn't listening at that address.
condor_status on the CM returns no output.
> If that happens, first verify that the collector is actually running (on
> CM). You can use 'ps' for this. Also, the Master should log that it
> the collector, and the collector itself will log to the CollectorLog.
The collector is running on the CM. Also, the logs show as you say.
> Hopefully, this should be enough to get you started.
Thank you, it's definitely pointed me in a good direction. I am guessing from
this that the CM is correctly setup. I should probably switch to tshooting the
clients. I am getting startd exiting with status 4 on the clients.
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>