[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking problem for 7.0.5->7.2.2 submission. New security issue?



Dan,

Bang on the money, well done. Setting BIND_ALL_INTERFACES=False got the flocked jobs working again.

Ta,
Mark

On Fri, Apr 24, 2009 at 4:33 PM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
Mark,

I'm guessing that you are not setting BIND_ALL_INTERFACES.

Starting in 7.1.1, BIND_ALL_INTERFACES is True by default.  This means
that setting NETWORK_INTERFACE without also setting
BIND_ALL_INTERFACES=False just has the effect of controlling which
interface Condor advertises, not which one it actually binds to (it
binds to all of them and will therefore use whichever one the OS chooses
in a particular case).

So I recommend setting BIND_ALL_INTERFACES=False and seeing if this
addresses your problem.

--Dan

Mark Calleja wrote:
> Hi All,
>
> (Apologies if you receive multiple copies of this post. The
> camgrid-users mailing list appears to be blocking another of my email
> addresses.)
>
> We currently run several pools (all linux) with v7.0.5 and are looking
> to upgrade piecemeal to v7.2.2. Encouraged by the entry in section 8.2
> of the v7.2.2 manual, namely "We believe that Condor 7.2.x and 7.0.x
> are wire-compatible, and can be freely mixed between computers in a
> Condor pool.", we've been testing upgrading some machines. However,
> we're seeing jobs getting rejected when the schedd is running 7.0.5
> and the startd is running 7.2.2. No other changes have been made, i.e.
> the configuration files have remained the same. Before I paste in the
> relevant parts of the log files, a bit of background: many of our
> machines have multiple IP addresses but Condor is forced to operate
> using a specific address, selected by the NETWORK_INTERFACE value in a
> machine's condor_config.local file. This address is always a "private"
> (RFC 1918) address in the range 172.24.xxx.xxx.
>
> Here's an example. The submit host has IP address 172.24.252.25 only,
> whereas the execute has two addresses: 131.111.xxx.xxx (which should
> *not* be used by Condor) and 172.24.116.4 (which should). So, here's
> the SchedLog from the submit host for when both submit and execute
> host are running 7.0.5 (job completes correctly):
>
> 4/20 17:45:08 Using config source: /etc/condor/condor_config
> 4/20 17:45:08 Using local config sources:
> 4/20 17:45:08    /usr/local/condor/local/condor_config.local
> 4/20 17:45:08    /usr/local/condor/local/condor_config.flocking
> 4/20 17:45:08 DaemonCore: Command Socket at <172.24.252.25:13743
> <http://172.24.252.25:13743>>
> 4/20 17:45:08 Initializing a VANILLA shadow for job 8.0
> 4/20 17:45:08 (8.0) (3799): Request to run on <172.24.116.4:9692
> <http://172.24.116.4:9692>> was ACCEPTED
> 4/20 17:45:09 (8.0) (3799): ZKM: setting default map to (null)
> 4/20 17:45:09 (8.0) (3799): Job 8.0 terminated: exited with status 0
> 4/20 17:45:09 (8.0) (3799): **** condor_shadow (condor_SHADOW) EXITING
> WITH STATUS 100
>
>
> Now the corresponding relevant snippet for when the execute host has
> been upgraded to 7.2.2 (job fails as file transfer does not take place):
>
> 4/18 06:19:52 Using config source: /etc/condor/condor_config
> 4/18 06:19:52 Using local config sources:
> 4/18 06:19:52    /usr/local/condor/local/condor_config.local
> 4/18 06:19:52    /usr/local/condor/local/condor_config.flocking
> 4/18 06:19:52 DaemonCore: Command Socket at <172.24.252.25:14228
> <http://172.24.252.25:14228>>
> 4/18 06:19:52 Initializing a VANILLA shadow for job 6.0
> 4/18 06:19:52 (6.0) (3719): Request to run on <172.24.116.4:9668
> <http://172.24.116.4:9668>> was ACCEPTED
> 4/18 06:19:52 (6.0) (3719): DaemonCore: PERMISSION DENIED to unknown
> user from host <131.111.xxx.xxx:9633> for command 61000
> (FILETRANS_UPLOAD), access level WRITE
> 4/18 06:19:52 (6.0) (3719): ERROR "Error from starter on
> XXXX.escience.cam.ac.uk <http://XXXX.escience.cam.ac.uk>: Failed to
> transfer files" at line 649 in file pseudo_ops.C
>
> It would appear that in 7.2.2 Condor's trying to make use of an
> interface on the execute host that's not the one nominated in
> NETWORK_INTERFACE (in this case it's the canonical, globally routeable
> address). Is there any reason why this has changed from 7.0.5? And is
> there any way of getting 7.2.2 to conform with the desired 7.0.5
> behaviour?
>
> Best regards,
> Mark
> ------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/