[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Permission problem



On May 1, 2015, at 8:54 AM, Angelo Fausti Neto <angelofausti@xxxxxxxxx> wrote:

Dear all,

we are running
$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $

on CentOS 6.3  rocks cluster
$CondorPlatform: x86_64_rhap_6.3 $

and we are facing permission problems that usually happen in one or two computing nodes and it is difficult to reproduce. For now the only way to avoid the problem is restarting condor but after a few job submissions the problem appears again

The submission log show erros like this

/mnt/scratch/users/angelofausti/master_des/000010018999/condor/*.log

007 (38783.000.000) 04/30 20:06:21 Shadow exception!
        Error from slot1@xxxxxxxxxx: Failed to open '/mnt/scratch/users/angelofausti/master_des/000010018999/condor/skymap_skymap_1.11
1.out' as standard output: Permission denied (errno 13)
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
012 (38783.000.000) 04/30 20:06:21 Job was held.
        Error from slot1@xxxxxxxxxx: Failed to open '/mnt/scratch/users/angelofausti/master_des/000010018999/condor/skymap_skymap_1.11
1.out' as standard output: Permission denied (errno 13)
        Code 7 Subcode 13

when that happens Condor executes the job with user and group nobody instead of the user that submitted the job and does not have permission to write in the user files.

In the computing node the  StartLog show erros like this

[angelofausti@nc02 ~]$ cat /var/opt/condor/log/StartLog | grep PERMISSION

04/30/15 23:44:15 PERMISSION DENIED to unauthenticated@unmapped from host 10.1.1.1 for command 440 (MATCH_INFO), access level NEGOTIATOR: reason: NEGOTIATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this        host: 10.1.1.1,ferocks.local, hostname size = 1, original ip address = 10.1.1.1

The decision of whether to run a job as the submitting user or the nobody user is based on the UID_DOMAIN configuration parameter of the submit and execute machines. With the usual configuration, in order to run the job as the submitting user, the value must be the same on the two machines and the value must be a substring of the submit machineâs full hostname. Otherwise, the job is run as the nobody user.

The value of UID_DOMAIN wonât change while a daemon is running, so it would be weird for an execute node to initially run jobs as the submitting user, and then start running them as user nobody (assuming the same submit machine is involved). One possibility is that after running for a while, the Condor startd starts getting a different result when determining the full hostname of the submit machine, such that it no longer matches the UID_DOMAIN value.
If that is happening, you will see the following message in the StarterLog.* logs:

ERROR: the submitting host claims to be in our UidDomain (%s), yet its hostname (%s) does not match.  If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name.  To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.

Do the PERMISSION DENIED errors only appear when the machine runs jobs as user nobody, or are they always there? These two errors should not be directly related. But a change in how hostnames are being resolved could connect them.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project