[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor installation



Ok, I fixed my problem.
Looking at the TCP packets I notice that icmp packets on ports 556 and 358 were being blocked.
I open the ports on the manager server, and now it works Ok.

To recap my problem.  I had installed condor on three servers.  All the servers worked locally, but when I typed condor_status the command would only report the manager. 

I had also opened TCP port 9618 for communication with the manager.

Oh, and my system were Linux servers

Thanks


On 12/26/06, Buhl, Marshall <Marshall_Buhl@xxxxxxxx> wrote:
Our site has been getting the same errors.  It seems to have started
after a reboot following the latest round of Microsoft security updates.
We tried setting up central management on a different server with a new
pool.  That worked for a short while.  Then, it broke again.

This is taking forever to debug because we usually can't get the Condor
service to restart (it refuses to stop).  We've had to reboot to get it
to use an updated condor_config.  Sometime (not always) we can do a
"condor_restart" and after waiting quite a while, it finally says "Sent
"Restart" command to local master."  However, as Jeremy reported, I've
been getting the following from a condor_status:

====
CEDAR:6001:Failed to connect to <192.88.248.58:9618>
Error: Couldn't contact the condor_collector on WIND-WAS4.nrel.gov.

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines
and
jobs in the Condor pool. The condor_collector might not be running, it
might
be refusing to communicate with you, there might be a network problem,
or
there may be some other problem. Check with your system administrator to
fix
this problem.

If you are the system administrator, check that the condor_collector is
running on WIND-WAS4.nrel.gov, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your
log
directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.
====

After a totally new installation on the central manager, I see this in
the Masterlog:

====
12/26 11:04:29 ** Log last touched time unavailable (No such file or
directory)
12/26 11:04:29 ******************************************************
12/26 11:04:29 Using config source: C:\condor\condor_config
12/26 11:04:29 Using local config sources:
12/26 11:04:29    C:\condor/condor_config.local
12/26 11:04:30 DaemonCore: Command Socket at < 192.88.248.58:1230>
12/26 11:14:30 WinFirewall: get_CurrentProfile failed: 0x800706d9
12/26 11:14:30 Started DaemonCore process
"C:\condor/bin/condor_collector.exe", pid and pgroup = 1820
12/26 11:14:30 Started DaemonCore process
"C:\condor/bin/condor_negotiator.exe", pid and pgroup = 2556
12/26 11:14:30 Started DaemonCore process
"C:\condor/bin/condor_schedd.exe", pid and pgroup = 3008
12/26 11:14:30 Started DaemonCore process
"C:\condor/bin/condor_startd.exe", pid and pgroup = 2964
12/26 11:14:30 DaemonCore: Command received via TCP from host
<192.88.248.58:1232>
12/26 11:14:30 DaemonCore: received command 453 (RESTART), calling
handler (admin_command_handler)
12/26 11:14:30 Sent signal 15 to COLLECTOR (pid 1820)
12/26 11:14:31 Sent signal 15 to NEGOTIATOR (pid 2556)
12/26 11:14:31 Sent signal 15 to SCHEDD (pid 3008)
12/26 11:14:39 Sent signal 15 to STARTD (pid 2964)
12/26 11:14:39 The COLLECTOR (pid 1820) exited with status 0
12/26 11:14:39 DaemonCore: Command received via UDP from host
<192.88.248.58:1262>
12/26 11:14:39 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
12/26 11:14:40 attempt to connect to <192.88.248.58:9618> failed:
connect errno = 10061 connection refused.
12/26 11:14:40 ERROR: SECMAN:2003:TCP connection to < 192.88.248.58:9618>
failed

12/26 11:14:40 Failed to start non-blocking update to
<192.88.248.58:9618>.
12/26 11:14:40 DaemonCore: Command received via UDP from host
<192.88.248.58:1269>
12/26 11:14:40 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
12/26 11:14:40 The SCHEDD (pid 3008) exited with status 0
12/26 11:14:40 DaemonCore: Command received via UDP from host
<192.88.248.58:1270>
12/26 11:14:40 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
12/26 11:14:40 The NEGOTIATOR (pid 2556) exited with status 0
12/26 11:14:44 DaemonCore: Command received via UDP from host
<192.88.248.58:1277>
12/26 11:14:44 DaemonCore: received command 60011 (DC_NOP), calling
handler (handle_nop())
12/26 11:14:44 The STARTD (pid 2964) exited with status 0
12/26 11:14:44 All daemons are gone.  Restarting.
12/26 11:14:44 Restarting master right away.
12/26 11:14:44 Running as NT Service = 1
12/26 11:14:44 Doing exec( "C:\WINDOWS\system32\cmd.exe /Q /C net stop
Condor & net start Condor" )
12/26 11:14:44 ******************************************************
12/26 11:14:44 ** Condor (CONDOR_MASTER) STARTING UP
12/26 11:14:44 ** C:\condor\bin\condor_master.exe
12/26 11:14:44 ** $CondorVersion: 6.8.2 Oct 12 2006 $
12/26 11:14:44 ** $CondorPlatform: INTEL-WINNT50 $
12/26 11:14:44 ** PID = 528
12/26 11:14:44 ** Log last touched 12/26 11:14:44
12/26 11:14:44 ******************************************************
12/26 11:14:44 Using config source: C:\condor\condor_config
12/26 11:14:44 Using local config sources:
12/26 11:14:44    C:\condor/condor_config.local
12/26 11:14:44 DaemonCore: Command Socket at <192.88.248.58:1278>
====

I see from the seventh line of the log that there was an issue regarding
the Windows Firewall.  When I tried to examine the firewall settings, I
was told that "Windows Firewall cannot run because the Windows
Firewall/Internet Connection Sharing (ICS) service is not running."  I
thought that meant that the firewall was not interfering with Condor
operation, however, I told it to go ahead and start the service.  It did
and it gave me a chance to change the firewall settings.  I told it to
leave the firewall off.

I tried a condor_restart after that and instead of waiting for a long
time to tell me it "Sent "Restart" command to local master," it did it
immediately.  This time, a condor_status worked!

I don't know if our problems are completely solved, but this may help
others get going again.


Marshall

Marshall L. Buhl Jr.
NREL/NWTC
Voice: +1 (303) 384-6914
Fax: +1 (303) 384-6901


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Sunday, December 24, 2006 2:36 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor installation

On Fri, Dec 22, 2006 at 11:16:54PM -0500, Jeremy Villalobos wrote:
> Hello,
> I have installed condor on three servers, I have one
manager,submit,execute
> host and two submit,execute host.
> All the deamons work with no errors reported on the log files.  but
the
> submit,execute nodes only report the master node's CPUs when I type
> condor_status.
>
> The same is true if I type condor_status on the master.
>

First off, some condor terminology: what you're calling the master is
really the "Central Manager".

You likely have the wrong values for COLLECTOR_HOST/CONDOR_HOST in the
config
files for your non-central-manager machines. (The default config files
have
COLLECTOR_HOST = $(CONDOR_HOST)

ie CONDOR_HOST is a Macro)


The snippets you sent are hard to debug without knowing which machine
should be which.

To make this easier to debug, it would be helpful to know:
the IPs of the central manager and the execute machines
the full CollectorLog from the central manager
a full StartLog or ScheddLog from one of the other machines
the values of 'condor_config_val COLLECTOR_HOST' on the execute machines
and the value of 'condor_config_val HOSTALLOW_WRITE' and
'condor_config_val HOSTALLOW_READ' on the central manager.

-Erik

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR