[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs remaining idle due to permission denied issue



Hello again,

So I've made a little bit of progress, but am still having issues. I believe I resolved some authentication issues by adding an ALLOW_WRITE to the common condor_config file.ÂÂ

Now, when I submit jobs, they are assigned to the node and begin executing. However, they crash immediately when they attempt to read a file from the submit machine's file directory. I was under the impression that then when a condor job is submitted and executed on another node, condor will spoof things such that the job still sees the submit machine's file directory. Do I need to configure something else to make this work?

The other issue I'm seeing is that when I run condor_reconfig -al or condor_restart -all, I get this:
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There is probably a problem with your credentials. Â(Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Reconfig command to master s01-012

********************************
Logs
********************************
On the central node:

SchedLog - Previous problems are resolved, but I still see this error which I forgot to include before:
my_popenv: Failed to exec in child, errno=2 (No such file or directory)
Failed to execute /usr/sbin/condor_shadow.std, ignoring

CollectorLog - No more errors

NegotiatorLog - No more errors

On execute node:

MasterLog - Says authentication is failing via GSI, KERBEROS, FS. For some reason, nothing is reported for password, the authentication method I have setup:
02/04/20 20:08:39 authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.
...
02/04/20 20:08:39 DC_AUTHENTICATE: required authentication of 141.212.115.83 failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:662). There is probably a problem with your credentials. Â(Did you run grid-proxy-init?)|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXNxznHI)


********************************
Configuration
********************************

The common condor_config file contains now:

## ÂWhere have you installed the bin, sbin and lib condor directories? Â
RELEASE_DIR = /usr

## ÂWhere is the local condor directory for each host? This is where the local config file(s), logs and
## Âspool/execute directories are located. this is the default for Linux and Unix systems.
LOCAL_DIR = /var

## ÂWhere is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
## ÂIf your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
## ÂIf the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = false

## ÂThe normal way to do configuration with RPMs is to read all of the
## Âfiles in a given directory that don't match a regex as configuration files.
## ÂConfig files are read in lexicographic order.
LOCAL_CONFIG_DIR = /etc/condor/config.d
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$

## ÂUse a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
## ÂTo expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
ALLOW_WRITE = */*.eecs.umich.edu
## ÂFLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).
#FLOCK_FROM =
## ÂFLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).
#FLOCK_TO = condor.cs.wisc.edu, cm.example.edu

##--------------------------------------------------------------------
## Values set by the debian patch script:
##--------------------------------------------------------------------

## For Unix machines, the path and file name of the file containing
## the pool password for password authentication.
#SEC_PASSWORD_FILE = $(LOCAL_DIR)/lib/condor/pool_password

## ÂPathnames
RUN Â Â = $(LOCAL_DIR)/run/condor
LOG Â Â = $(LOCAL_DIR)/log/condor
LOCK Â Â= $(LOCAL_DIR)/lock/condor
SPOOL Â = $(LOCAL_DIR)/spool/condor
EXECUTE = $(LOCAL_DIR)/lib/condor/execute
CRED_STORE_DIR = $(LOCAL_DIR)/lib/condor/cred_dir
ETC Â Â = /etc/condor
BIN Â Â = $(RELEASE_DIR)/bin
LIB Â Â = $(RELEASE_DIR)/lib/condor
INCLUDE = $(RELEASE_DIR)/include/condor
SBIN Â Â= $(RELEASE_DIR)/sbin
LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec
SHARE Â = $(RELEASE_DIR)/share/condor
MAIL Â Â= /usr/bin/mail
GANGLIA_LIB64_PATH = /lib,/usr/lib,/usr/local/lib

PROCD_ADDRESS = $(RUN)/procd_pipe

## ÂInstall the minihtcondor package to run HTCondor on a single node


The security config file contains:

SEC_PASSWORD_FILE = /etc/condor/password.d/POOL
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD, KERBEROS, GSI
ALLOW_DAEMON = */*.<<< rest of hostname >>>, \
Â*/$(IP_ADDRESS)
ALLOW_NEGOTIATOR = */<<< hostname >>>

Thank you, any help would be appreciated
Jonathan Bailey

On Tue, Feb 4, 2020 at 12:45 PM Jonathan Bailey <jbaile@xxxxxxxxx> wrote:
Hello,

I currently have:
ALLOW_DAEMON = condor_pool@*/s01-*.<<< rest of hostname >>>, \
Âcondor@*/$(IP_ADDRESS)
ALLOW_NEGOTIATOR = condor_pool@*/<<<hostname >>>

I also tried
ALLOW_DAEMON = */*
ALLOW_NEGOTIATOR = */*

and deletingÂALLOW_DAEMON andÂALLOW_NEGOTIATOR entirely. However, the outcome was the same with these settings.

Thanks,
Jonathan

On Tue, Feb 4, 2020 at 8:16 AM Bockelman, Brian <BBockelman@xxxxxxxxxxxxx> wrote:
Hi Jonathon,

From the error messages, it looks like the authentication worked (I.e., right password) but the authorization was denied. Whatâs in your various ALLOW_* and DENY_* configurations? Particularly, I suspect you want to double-check the value of ALLOW_DAEMON.

Brian

Sent from my iPhone

On Feb 3, 2020, at 7:02 PM, Jonathan Bailey <jbaile@xxxxxxxxx> wrote:

ï
I am new to condor administration and am having trouble getting a new condor setup working. The system runs Ubuntu 18.04 and has one central node and many execute nodes which have been set up followingÂhttps://www-auth.cs.wisc.edu/lists/htcondor-users/2019-December/msg00000.shtml, including a security configuration identical (except for host names) to the one in slide 13 here:ÂÂhttps://agenda.hep.wisc.edu/event/1325/session/16/contribution/41/material/slides/0.pdf. condor_status shows the expected executed nodes. However, when I submit jobs, they remain idle indefinitely.

On the central node, I have the following issues showing up in the logs:

SchedLog:
Can't find address for startd kremlin
SECMAN: FAILED: Received "DENIED" from server for user condor_pool@kremlin using method PASSWORD.
ERROR: SECMAN:2010:Received "DENIED" from server for user condor_pool@kremlin using method PASSWORD.
Failed to start non-blocking update toÂ<<< ip address >>>.

CollectorLog:
PERMISSION DENIED to condor_pool@kremlin from hostÂ<<< ip address >>>Âfor command 1 (UPDATE_SCHEDD_AD), access level ADVERTISE_SCHEDD: reason: cached result for ADVERTISE_SCHEDD; see first case for the full reason
DC_AUTHENTICATE: Command not authorized, done!

NegotiatorLog:
PERMISSION DENIED to condor_pool@kremlin from hostÂ<<< ip address >>>Âfor command 421 (Reschedule), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: <<< ip address >>>,<<< host name >>>, hostname size = 1, original ip address = <<< ip address >>>

I have double checked that the central node and execute node have the same password POOL. I have also tried disabling the authentication requirementsÂset in the security config, but this only caused the execute node to disappearÂfrom condor_status's output (even after regenerating POOL and running condor_config and / or restarting on both central and execute nodes).

Any help would be appreciated.

Thank you,
Jonathan Bailey

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/