[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Permission problem



Dear Zach and Jaime,

indeed the ALLOW_NEGOTIATOR variable had a different domain we fixed that and restarted condor. We still see the PERMISSION DENIED message howeve less frequent than before and it is slightly different now:

05/01/15 18:50:42 PERMISSION DENIED to unauthenticated@unmapped from host 10.1.1.10 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.1.1.10, hostname size = 0, original ip address = 10.1.1.10

we also replaced names by IPs in the configuration file. We keep monitoring and plan to upgrade condor to 8.2.x

thanks for the help!


Angelo Fausti Neto
LIneA (www.linea.gov.br)

Skype: angelofausti
Cell phone: +55 51 81142801



On Fri, May 1, 2015 at 1:00 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
On May 1, 2015, at 8:54 AM, Angelo Fausti Neto <angelofausti@xxxxxxxxx> wrote:

Dear all,

we are running
$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $

on CentOS 6.3Â rocks cluster
$CondorPlatform: x86_64_rhap_6.3 $

and we are facing permission problems that usually happen in one or two computing nodes and it is difficult to reproduce. For now the only way to avoid the problem is restarting condor but after a few job submissions the problem appears again

The submission log show erros like this

/mnt/scratch/users/angelofausti/master_des/000010018999/condor/*.log

007 (38783.000.000) 04/30 20:06:21 Shadow exception!
ÂÂÂÂÂÂÂ Error from slot1@xxxxxxxxxx: Failed to open '/mnt/scratch/users/angelofausti/master_des/000010018999/condor/skymap_skymap_1.11
1.out' as standard output: Permission denied (errno 13)
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Sent By Job
ÂÂÂÂÂÂÂ 0Â -Â Run Bytes Received By Job
...
012 (38783.000.000) 04/30 20:06:21 Job was held.
ÂÂÂÂÂÂÂ Error from slot1@xxxxxxxxxx: Failed to open '/mnt/scratch/users/angelofausti/master_des/000010018999/condor/skymap_skymap_1.11
1.out' as standard output: Permission denied (errno 13)
ÂÂÂÂÂÂÂ Code 7 Subcode 13

when that happens Condor executes the job with user and group nobody instead of the user that submitted the job and does not have permission to write in the user files.

In the computing node the StartLog show erros like this

[angelofausti@nc02 ~]$ cat /var/opt/condor/log/StartLog | grep PERMISSION

04/30/15 23:44:15 PERMISSION DENIED to unauthenticated@unmapped from host 10.1.1.1 for command 440 (MATCH_INFO), access level NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for thisÂÂÂÂÂÂÂ host: 10.1.1.1,ferocks.local, hostname size = 1, original ip address = 10.1.1.1

The decision of whether to run a job as the submitting user or the nobody user is based on the UID_DOMAIN configuration parameter of the submit and execute machines. With the usual configuration, in order to run the job as the submitting user, the value must be the same on the two machines and the value must be a substring of the submit machineâs full hostname. Otherwise, the job is run as the nobody user.

The value of UID_DOMAIN wonât change while a daemon is running, so it would be weird for an execute node to initially run jobs as the submitting user, and then start running them as user nobody (assuming the same submit machine is involved). One possibility is that after running for a while, the Condor startd starts getting a different result when determining the full hostname of the submit machine, such that it no longer matches the UID_DOMAIN value.
If that is happening, you will see the following message in the StarterLog.* logs:

ERROR: the submitting host claims to be in our UidDomain (%s), yet its hostname (%s) does not match. If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name. To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.

Do the PERMISSION DENIED errors only appear when the machine runs jobs as user nobody, or are they always there? These two errors should not be directly related. But a change in how hostnames are being resolved could connect them.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/