[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Migrating 7.2->7.4 job submission woe



Mark,

Did you revert to this configuration _and_ revert to the 7.2.5 schedd? I'm confused now about what combinations of versions and configurations you have tried and what the result was.

Version 7.4.x should support all the same security-related configuration options that were supported in 7.2.x. The only change that was made was to convert the default condor_config file that is packaged with condor from HOSTALLOW/DENY to ALLOW/DENY. Both 7.2.x and 7.4.x should fully support both syntaxes.

--Dan

Mark Calleja wrote:
Hi Tim,

I've reverted to using the (deprecated) HOST[ALLOW|DENY] formalism, and have:

HOSTALLOW_NEGOTIATOR = $(CONDOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)

Mark

On 19/04/2010 17:54, Steven Timm wrote:
What is the value of ALLOW_NEGOTIATOR_SCHEDD, you
have to have it set to allow all the various negotiators
to which you could flock. Mine is set to $(COLLECTOR_HOST).

Steve


On Mon, 19 Apr 2010, Mark Calleja wrote:

So, I tried just swapping the condor_schedd binary on the submit host, using the 7.2.5 binary (but all else left as 7.4.2), and restarted that daemon. Sure enough, it all burst into life. That submit host can now hit execute hosts running 7.4.2 in the local pool as well as in a flocked pool (also running 7.4.2).

Has anyone else seen this problem?

Mark

On 19/04/2010 11:03, Mark Calleja wrote:
I'm afraid to say that I'm still seeing these errors for 7.4.2. Setting SCHEDD_DEBUG = D_FULLDEBUG doesn't cast that much extra light on what's happening:

04/19 10:51:46 Entered negotiate
04/19 10:51:46 Will use UDP to update collector tempo--escience.grid.private.cam.ac.uk <172.24.116.1:9618>
04/19 10:51:46 Trying to query collector <172.24.116.1:9618>
04/19 10:51:46 Unknown negotiator (172.24.116.1). Aborting negotiation.

I've also set

NEGOTIATOR_ADDRESS_FILE  = $(LOG)/.negotiator_address

on the central manager and performed a condor_reconfig, and ensured that SEC_DEFAULT_NEGOTIATION = NEVER across the pool (as it was in 7.2.5 and worked).

Did any default security settings change in the transition 7.2 -> 7.4? Or is there anything else I might be missing?

Mark

On 14/04/2010 10:21, Mark Calleja wrote:
Hi Dan,

It all looks sensible. So on the submit node running 7.4.2:

$ condor_config_val ALLOW_NEGOTIATOR
tempo--escience.grid.private.cam.ac.uk
$ condor_config_val ALLOW_NEGOTIATOR_SCHEDD
tempo--escience.grid.private.cam.ac.uk,

These are exactly the same values I get for HOSTALLOW_NEGOTIATOR and HOSTALLOW_NEGOTIATOR_SCHEDD on the successful submit node running 7.2.5.
From nslookup we see that this Negotiator is the very host that gets
mentioned in the 7.4.2 submit host's SchedLog as being the unknown Negotiator:

###
$ nslookup tempo--escience.grid.private.cam.ac.uk
Server:         131.111.8.42
Address:        131.111.8.42#53

Name:   tempo--escience.grid.private.cam.ac.uk
Address: 172.24.116.1
###

Any ideas? In the meantime I'll keep digging in case the penny drops.

Cheers,
Mark

ps. I'm running on 32-bit Debian Lenny machines, using the dynamically-linked x86 debian50 build from the Wisconsin repository.

On 13/04/2010 18:38, Dan Bradley wrote:
Mark,

Check the configuration of ALLOW_NEGOTIATOR and ALLOW_NEGOTIATOR_SCHEDD in the configuration of the submit machine.

Let me know if it still doesn't make sense.

--Dan

Mark Calleja wrote:
Hi,

I'm testing out v7.4.2 of Condor and have run into a job submission problem. Firstly, I should say that the pool is an upgraded 7.2.5 pool, with the same condor_config file but with HOSTALLOW/HOSTDENY entries changed to ALLOW/DENY, as recommended in the release notes for 7.4.0. Submitting a simple test job fails to run, even though "condor_q -better" shows that there are available resources. A look at the NegotiatorLog has the relevant snippet:

04/13 16:33:49 Negotiating with xxxx@xxxxxxxxxxxxx at <172.24.116.7:9682>
04/13 16:33:49 0 seconds so far
04/13 16:33:49 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from schedd xxxx@xxxxxxxxxxxxxx
04/13 16:33:49 IO: Failed to read packet header
04/13 16:33:49     Failed to get reply from schedd

A look at the corresponding entry in the submitter's SchedLog has:

04/13 16:33:36 (pid:19045) Sent ad to central manager for xxxx@xxxxxxxxxxxxx 04/13 16:33:36 (pid:19045) Sent ad to 1 collectors for xxxx@xxxxxxxxxxxxx 04/13 16:33:49 (pid:19045) Unknown negotiator (172.24.116.1). Aborting negotiation.

As can be surmised from the above, the submit host has IP address 172.24.116.7 and the central manager has 172.24.116.1. It looks like the Schedd doesn't trust the Negotiator, right? By comparison, when I submit the same job from a machine still running 7.2.5 to the same central manager, then the job runs just fine. That is:

submit host (7.4.2) -> central manager (7.4.2): Fails
submit host (7.2.5) -> central manager (7.4.2): Succeeds

Is there some new/extra configuration that needs to be carried out on a submit host running 7.4 compared to that on 7.2?

Cheers,
Mark

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/