[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor 8.6.5



Thanks everyone for the emails.  The two switches that ended up being critical were (from /etc/condor/condor_config.local),


STARTER_ALLOW_RUNAS_OWNER = True 

TRUST_UID_DOMAIN = True


Nathan


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Sent: Monday, August 7, 2017 5:16:07 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor 8.6.5
 
Hey Nathan,

One thing that could be happening here - and forgive me if I missed this in an earlier e-mail - is that your config is not running the job under the UID of the job submitter, but as "nobody" - that is, STARTER_ALLOW_RUNAS_OWNER is false in the exec node's config, or the RunAsOwner job attribute is false, or under a different user you're using per-slot users.

The reason I suspect this might be happening is that by default HTCondor runs the "filechecks" to make sure that you have access where access is needed before submitting the job to the schedd, but that's done on the machine where you're running condor_submit, rather than on the remote exec machine, so there's some sort of inconsistency there.

One easy way to identify this without having to rummage through ClassAds would be to temporarily chmod 777 /home/nmoore/condor_sub, then condor_release one of the held jobs. Ideally you'd expect it to succeed in creating the job-X.out file without a permission-denied error, and then you'd be able to see the username or uid which created it. Don't forget to chmod 755 when you're done.

If you still get a permission-denied error even with 777, then some other things to check would be to log in to "pilgrim" and try "touch /home/nmoore/condor_sub/job-3.out" and see what it says. Perhaps the /home volume is mounted read-only due to some automounter or fstab typo? Or perhaps condor_sub is mode 555? If so, then your regular non-Condor login session would get a permission-denied error as well.

If you're running auditd, you can check the audit log to figure out who was making which system call when the failure occurred on the exec node.

Good luck!

        -Michael Pelletier.



======================
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Moore, Nathan T
Sent: Monday, August 07, 2017 6:01 PM

The problem seems to persist, 
[nmoore@pilgrim ~]$ condor_q -hold


-- Schedd: pilgrim : <199.17.158.20:9618?... @ 08/07/17 16:53:54
 ID      OWNER          HELD_SINCE  HOLD_REASON
  11.0   nmoore          8/7  12:04 Error from slot1@pilgrim: Failed to open '/home/nmoore/condor_sub/job-3.out' as standard output: Permission denied (errno 13)
  12.0   nmoore          8/7  12:04 Error from slot2@pilgrim: Failed to open '/home/nmoore/condor_sub/job-2.out' as standard output: Permission denied (errno 13)
  13.0   nmoore          8/7  12:04 Error from slot2@pilgrim: Failed to open '/home/nmoore/condor_sub/job.out' as standard output: Permission denied (errno 13)

3 jobs; 0 completed, 0 removed, 0 idle, 0 running, 3 held, 0 suspended


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/