[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] schedd problem



I'm trying to set up a Condor cluster on RHEL5.3. The setup is a little
interesting and seems to be a integral part of the problem, so I will
describe it:
Nine compute machines (running STARTD, SCHEDD) exist on a private subnet
and cannot be reached (by design) from the rest of my organization. They
are connected to a switch which is also connected to the primary and
secondary central managers. The primary/secondary CMs have two NICs
each. Each CM has an IP on the private subnet and on my organization's
public network.

The problem:
I can't seem to get submitted jobs to run from the public network. When
I run condor_q -analyze #.#, I get the familiar "Of 72 machines, ... 72
match but reject the job for unknown reasons". The odd thing is that
jobs submitted from the private network (i.e. from a compute machine) do
run. There wasn't anything very interesting in the logs on the CMs or
compute machines, but I did find this in SchedLog on the public network
machine I submitted the job from:

7/8 15:53:56 (pid:24838)
******************************************************
7/8 15:53:56 (pid:24838) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
7/8 15:53:56 (pid:24838) ** /usr/sbin/condor_schedd
7/8 15:53:56 (pid:24838) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5)
class=DAEMON(1)
7/8 15:53:56 (pid:24838) ** Configuration: subsystem:SCHEDD local:<NONE>
class:DAEMON
7/8 15:53:56 (pid:24838) ** $CondorVersion: 7.2.1 Jul  2 2009 BuildID:
RH-7.2.2-0.9.el5 $
7/8 15:53:56 (pid:24838) ** $CondorPlatform: X86_64-LINUX_RHEL5 $
7/8 15:53:56 (pid:24838) ** PID = 24838
7/8 15:53:56 (pid:24838) ** Log last touched 7/8 15:53:54
7/8 15:53:56 (pid:24838)
******************************************************
7/8 15:53:56 (pid:24838) Using config source: /etc/condor/condor_config
7/8 15:53:56 (pid:24838) Using local config sources:
7/8 15:53:56 (pid:24838)    /var/lib/condor/condor_config.local
7/8 15:53:56 (pid:24838) DaemonCore: Command Socket at
<130.207.197.122:40151>
7/8 15:53:56 (pid:24838) History file rotation is enabled.
7/8 15:53:56 (pid:24838)   Maximum history file size is: 20971520 bytes
7/8 15:53:56 (pid:24838)   Number of rotated history files is: 2
7/8 15:53:56 (pid:24838) "/usr/sbin/condor_shadow.std -classad" did not
produce any output, ignoring
7/8 15:54:44 (pid:24838) Sent ad to central manager for jbrewer8@xxxxxxxxxxxxxxxxxxxxxxxx
7/8 15:54:44 (pid:24838) Sent ad to 2 collectors for
jbrewer8@xxxxxxxxxxxxxxxxxxxxxxxx
7/8 15:54:44 (pid:24838) Called reschedule_negotiator()
7/8 15:59:44 (pid:24838) Sent ad to central manager for
jbrewer8@xxxxxxxxxxxxxxxxxxxxxxxx
7/8 15:59:44 (pid:24838) Sent ad to 2 collectors for
jbrewer8@xxxxxxxxxxxxxxxxxxxxxxxx
7/8 15:59:44 (pid:24838) Can't find address for startd
jhb-579.stuff.gatech.edu
7/8 16:04:44 (pid:24838) Sent ad to central manager for
jbrewer8@xxxxxxxxxxxxxxxxxxxxxxxx
7/8 16:04:44 (pid:24838) Sent ad to 2 collectors for
jbrewer8@xxxxxxxxxxxxxxxxxxxxxxxx
7/8 16:04:47 (pid:24838) Can't find address for startd
jhb-579.stuff.gatech.edu

The last three lines repeat every five minutes.

Currently the CMs are not running with firewalls and their HOSTALLOW_*
macros are set to *.

Thanks.