[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting started with 2 nodes



On Mon, Jun 11, 2012 at 05:03:45PM +0000, Rich Pieri wrote:
> What do the log files in /home/condor/log report?

Ah I see, thank you:
06/11/12 17:33:44 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.1.2 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.1.2,dev-storage2, hostname size = 1, original ip address = 10.0.1.2
06/11/12 17:33:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.1.2 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: cached result for ADVERTISE_STARTD; see first case for the full reason

Rob de Graaf wrote:
> You probably need to set up authorisation so that storage2 can join storage1's
> pool. Look at the ALLOW_WRITE setting in storage1's condor_config.

Should that go in condor_config on all machines, or just condor_config.local
on the master? I'm guessing ALLOW_WRITE needs to list all the exec nodes
plus all the submit nodes?

For now I set ALLOW_WRITE=* in condor_config on the first machine, and that
solved the node visibility problem, thank you :-)

Although the installation instructions at
http://research.cs.wisc.edu/condor/manual/v7.8/3_2Installation.html
do say "you might want to set up security for Condor", I think it would be
helpful if it said here that you need to change ALLOW_WRITE on the master.
Otherwise it just links to a highly complex section of the manual (3.6.1)
about setting up mutual authentication, encryption etc.

Matthew Farrellee wrote:
> http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/
> was written using Fedora, but hopefully isn't very different for
> ubuntu (use apt vs yum).

That is very useful, thank you.

The cluster I have here has 2 nodes each with 4 slots.

Now, I tried the job.sub submitting 8 jobs (but with 'sleep 60' rather than
'sleep 1d'). What happens is that 3 jobs run only on the first node. When
those have finished, another 3 run on the first node. Finally the last two
only run on the first node.

I tried following the instructions at
http://research.cs.wisc.edu/condor/manual/v7.8/2_6Managing_Job.html#SECTION00365000000000000000
(There appears to be a bug there: it says "condor_q -pool -analyze <job>"
but -pool needs a hostname argument)

    brian@dev-storage1:~$ condor_q -pool localhost -analyze 6.3


    -- Submitter: dev-storage1.example.com : <10.0.1.1:50852> : dev-storage1.example.com
    ---
    006.003:  Run analysis summary.  Of 8 machines,
          0 are rejected by your job's requirements 
          5 reject your job because of their own requirements 
          3 match but are serving users with a better priority in the pool 
          0 match but reject the job for unknown reasons 
          0 match but will not currently preempt their existing job 
          0 match but are currently offline 
          0 are available to run your job
            No successful match recorded.
            Last failed match: Tue Jun 12 10:08:02 2012

            Reason for last match failure: no match found

    The following attributes are missing from the job ClassAd:

    CheckpointPlatform

I didn't get any detail of what "their own requirements" were for rejecting
the job.

Anyway, by guesswork I tried adding "ALLOW_WRITE = *" to dev-storage2 and
that seems to have fixed that problem (or maybe it was just the restart). 
Do execute nodes need ALLOW_WRITE from the manager?  Perhaps the default
should be:

ALLOW_WRITE = $(FULL_HOSTNAME), $(IP_ADDRESS), $(CONDOR_HOST)
?

Finally, I note that one core was in state "Owner" and did not run any jobs.
I read through
http://research.cs.wisc.edu/condor/manual/v7.8/3_12Setting_Up.html#SECTION004128000000000000000
and (I think) fixed this using:
SLOTS_CONNECTED_TO_CONSOLE = 0
SLOTS_CONNECTED_TO_KEYBOARD = 0
on both nodes. (These are intended to be headless nodes)

> If you find the instructions don't work, let me know and I'll see if I
> can update them.

Your instructions are good, the additional things I had to dig out would
have been nice to have :-)

One minor comment from this newbie: the official manual's contents page only
goes two levels deep.  As a result it can be pretty hard to find things
which I *know* I read before, but don't remember where. e.g.

Why is the job not running?
http://research.cs.wisc.edu/condor/manual/v7.8/2_6Managing_Job.html#SECTION00365000000000000000

Slots on SMP machines:
http://research.cs.wisc.edu/condor/manual/v7.8/3_12Setting_Up.html#SECTION004128000000000000000

Dynamic slots:
http://research.cs.wisc.edu/condor/manual/v7.8/3_12Setting_Up.html#SECTION004128900000000000000

Thanks,

Brian.