[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Inconsistent execute dir permissions



On 3/17/2016 10:34 AM, John Hover wrote:
Hi all,

We're having an issue and I'm wondering if you can provide guidance.


Starting with a few questions just to eliminate some quick possibilities -

Do you just have one partitionable slot per startd, or do you also have some additional static slots? Or more than one partitionable slot?

Do you execute machines have all the local accounts setup in /etc/passwd for every possible dynamic slot, e.g. on a 32-way machine do you have user account slot1 thru slot32, or perhaps some machines only have user accounts slot1 thru slot8

Similar to the above, do you specify at least as many SLOT1_X_USER entries as your machine has CPU cores?

Are you having HTCondor run via glexec on your execute nodes?

regards,
Todd


Setup is partitionable slots, as typical, with slot users:

DEDICATED_EXECUTE_ACCOUNT_REGEXP = slot.+
STARTER_ALLOW_RUNAS_OWNER = False
SLOT1_1_USER = slot1
SLOT1_2_USER = slot2
SLOT1_3_USER = slot3
SLOT1_4_USER = slot4
<etc>

But on the nodes, I see inconsistent execute directory ownership,
sometimes a mix of slot users and condor. Other times all owned by condor.

I'm seeing job errors that are consistent with failure to read in those
directories by the job running as the user.

[root@ip-10-153-131-168 ~]# ls -alh /home/condor/execute
total 56K
drwxr-xr-x. 6 condor condor 4.0K Mar 16 22:08 .
drwxr-xr-x. 3 condor condor 4.0K Mar 10 13:09 ..
drwx------. 7 condor condor  12K Mar 16 21:49 dir_1043940
drwx------. 7 condor condor  12K Mar 16 22:06 dir_1062269
drwx------. 7 slot4  slot4   12K Mar 16 22:08 dir_1064108
drwx------. 7 slot3  slot3   12K Mar 16 22:09 dir_1064289

[root@ip-10-121-2-98 ~]# ls -alh /home/condor/execute/
total 56K
drwxr-xr-x. 6 condor condor 4.0K Mar 16 22:00 .
drwxr-xr-x. 3 condor condor 4.0K Mar 10 13:09 ..
drwx------. 7 condor condor  12K Mar 16 21:56 dir_1466019
drwx------. 7 condor condor  12K Mar 16 22:02 dir_1467286
drwx------. 7 condor condor  12K Mar 16 22:02 dir_1467287
drwx------. 7 condor condor  12K Mar 16 22:02 dir_1467288

Any idea how this would be happening? Log entries to look for? Ever seen
it before? Any config changes to try?

Thanks,

--john



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685