[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Troubleshooting process



On Tuesday 02 March 2004 6:00 pm, kge2@xxxxxxxx wrote:
> condor_status on a client returns nothing.  The clients only have
> condor_master running and nothing else.  condor_startd keeps exiting with
> result '4'. condor_config_val returned "MASTER,STARTD".  The demons are
> trying to connect to the server at 172.16.0.1.  This is correct.
>
> Maybe I have the firewall misconfigured.  I thought I just disabled
> everything firewall related, as this is a totally isolated network.
>
>
> Another error I've found is that the file "condor_starter.pvm" is missing.
> Where can I find that?

Let's back up and go through a couple of basics, ok?  In your earlier message, 
you talked about a non-execute machine, but here you're talking about the 
startd; this leads me to believe that there may be some confusion which I can 
hopefully help clear up.

1. Execute machines are, basically, a machine that runs a startd.  The execute 
machine can also be a sumbit machine and / or a "central manager".
2. Submit machines run the schedd.
3. Central managers run the collector and negotiator.  There is just one CM 
per pool.
4. A single machine can fill more than one of these roles.

Above, you're talking about a startd, and state that the startd is exiting 
with status 4 (I assume that you're getting this information from the 
MasterLog).  You can start to diagnose what's going wrong with the startd by 
looking in the StartLog on that machine.  If there's not enough information, 
you may need to turn on more debugging by editing the machine's condor_config 
and editing the STARTD_DEBUG setting; try something like this:
STARTD_DEBUG        = D_COMMAND D_FULLDEBUG D_JOB

Now, let's go back to trying to get your CM working.  You'll most certainly 
want to set the NETWORK_INTERFACE parameter on this machine; set it to the IP 
address of the NIC that you want it to talk over.  For example, if eth0 is 
internal with an address of 192.168.1.10 and eth1 external and an address of 
128.105.2.6, and you wanted it to talk over eth0 (internal), you would have a 
line like this:
NETWORK_INTERFACE = 192.168.1.10

If you look in the MasterLog or CollectorLog of this machine, you should see 
(near the top) a line like:
3/2 21:59:06 DaemonCore: Command Socket at <192.168.1.10:1234>
Verify that it's listening on the right address.

If the Collector is running properly, condor_status will show a list of 
startd's that are reporting to it.  If there are none, it won't report an 
error, it'll just show you that there are none.  If you get an error, it 
means that either it's not configured to talk to the Collector at the right 
address, or that the Collector itself isn't listening at that address.

If that happens, first verify that the collector is actually running (on the 
CM).  You can use 'ps' for this.  Also, the Master should log that it started 
the collector, and the collector itself will log to the CollectorLog.

Hopefully, this should be enough to get you started.

Happy computing!

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>