[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Write errors on secondary disk



Update with some more information:


So, I can submit jobs from herc1 and starscream and they write correctly and without issue. The only discernible difference I can see is that the UID:GID for the condor accounts are unique on herc1/starscream, but on herc0 the group ID was assigned to an existing group:


id condor

uid=987(condor) gid=987(chrony) groups=987(chrony)


I did groupmod -g 982 condor and usermod -g 982 condor to give the condor account a new gid:


$ id condor
uid=987(condor) gid=982(condor) groups=982(condor),1000(labuser)

and updated condor_config:


CONDOR_IDS = 987.982


but I'm still getting the same error. Is there some other permission that I'm missing or is this a red herring?


Thanks,


Zach





From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of htcondor-users-request@xxxxxxxxxxx <htcondor-users-request@xxxxxxxxxxx>
Sent: Friday, September 9, 2016 7:11 PM
To: htcondor-users@xxxxxxxxxxx
Subject: HTCondor-users Digest, Vol 34, Issue 11
 
Send HTCondor-users mailing list submissions to
        htcondor-users@xxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
or, via email, send a message with subject or body 'help' to
        htcondor-users-request@xxxxxxxxxxx

You can reach the person managing the list at
        htcondor-users-owner@xxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of HTCondor-users digest..."


Today's Topics:

   1. Re: CREDMON errors and SEC_CREDENTIAL_DIRECTORY (Carles Acosta)
   2. Write errors on secondary disk (Hughes, Zachary)


----------------------------------------------------------------------

Message: 1
Date: Wed, 07 Sep 2016 10:53:34 +0200
From: Carles Acosta <cacosta@xxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] CREDMON errors and
        SEC_CREDENTIAL_DIRECTORY
Message-ID: <57CFD58E.5080004@xxxxxx>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Iain,

Ok, thank you very much!!

Cheers,

Carles

On 09/07/2016 10:23 AM, Iain Bradford Steers wrote:
> Hi Carles,
>
> The SEC_CREDENTIAL stuff is for secure storage and forwarding of kerberos credentials for users.
>
> Just ignore it for now, remove the extra settings you've put in. I suspect the part of the schedd checking and logging those isn't
> checking whether they are being used or not.
>
> Cheers, Iain
>
> ________________________________________
> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Carles Acosta [cacosta@xxxxxx]
> Sent: 07 September 2016 10:12
> To: htcondor-users@xxxxxxxxxxx
> Subject: [HTCondor-users] CREDMON errors and SEC_CREDENTIAL_DIRECTORY
>
> Dear all,
>
> I'm testing HTcondor-CE with HTcondor (8.5.5-1). Users submit jobs with
> their proxy, they are mapped to a local user and routed by htcondor-CE
> to the condor schedd. Everything is working fine. However, in the
> SchedLogs for condor-ce and condor, I see several warnings, for instance:
>
> ZKM: creating mark file for user dteam004
> CREDMON: ERROR: got mark_creds_for_sweeping but SEC_CREDENTIAL_DIRECTORY
> not defined!
>
> But, as I've said, everything is working.
>
> I've been searching in htcondor manual and htcondor-ce wikis and it's
> not clear for me what the SEC_CREDENTIAL_DIRECTORY is. I found somewhere
> that it's related with kerberos usage, but I don't use kerberos (the
> HTcondor pool is based in PASSWORD authentication).
>
> So, I added in htcondor and htcondor-ce configurations a path to
> SEC_CREDENTIAL_DIRECTORY and reloaded the daemons. Marks were created in
> credential directory:
>
> # ls -lrth /var/lib/condor/credential/
> total 0
> -rw------- 1 root root 0 Sep  6 17:35 dteam004.mark
>
> But then, after restarting the condor and condor-ce schedds, they
> started to fail:
>
> [...]
> 09/07/16 09:11:25 SCHEDD: User credentials not up-to-date.  Start-up
> delayed.  Waiting 10 seconds and trying 60 more times.
> 09/07/16 09:11:55 CREDMON: FAILURE: credmon never created
> /var/lib/condor-ce/credentials/CREDMON_COMPLETE after 20 seconds!
> [...]
>
> Thus, my question is what is the SEC_CREDENTIAL_DIRECTORY? Do I really
> need it?
>
> Cheers,
>
> Carles
>
> --
> Carles Acosta i Silva
> PIC (Port d'Informaci? Cient?fica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 22
> Fax: +34 93 581 41 10
> http://www.pic.es
> Av?s - Aviso - Legal Notice:http://www.ifae.es/legal.html
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informaci? Cient?fica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Av?s - Aviso - Legal Notice: http://www.ifae.es/legal.html



------------------------------

Message: 2
Date: Sat, 10 Sep 2016 00:11:22 +0000
From: "Hughes, Zachary" <zdhughes@xxxxxxxxx>
To: "htcondor-users@xxxxxxxxxxx" <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Write errors on secondary disk
Message-ID:
        <MWHPR02MB2239A0F3200B825691D5B41FD3FA0@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
       
Content-Type: text/plain; charset="us-ascii"

So, an overview:


I have 3 machines in a condor cluster: herc0, herc1, and starscream. All of them mount home directories from a fourth, optimus. On each of the condor machines I have the home directories mounted at /nfs/optimus/home/ . herc0 has a secondary drive mounted locally as /local_data0 and also for all machines at /nfs/data_disks/herc0b .


Using the tutorial program, simple.c, I can successfully run the jobs in my home directory. All cores are used, all save write to disk. If I cd into a directory in  /nfs/data_disks/herc0b I get the following errors submitting:


[zdhughes@herc0 zdhughes]$ condor_submit submit
Submitting job(s)..............................
30 job(s) submitted to cluster 81.

WARNING: File /nfs/data_disks/herc0b/users/zdhughes/simple.error is not writable by condor.

WARNING: File /nfs/data_disks/herc0b/users/zdhughes/simple.out is not writable by condor.


And the ShadowLog has:



09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP
09/09/16 18:50:42 ** /usr/sbin/condor_shadow
09/09/16 18:50:42 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
09/09/16 18:50:42 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
09/09/16 18:50:42 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
09/09/16 18:50:42 ** $CondorPlatform: x86_64_RedHat7 $
09/09/16 18:50:42 ** PID = 30540
09/09/16 18:50:42 ** Log last touched 9/9 18:47:43
09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 Using config source: /etc/condor/condor_config
09/09/16 18:50:42 Using local config sources:
09/09/16 18:50:42    /etc/condor/condor_config.local
09/09/16 18:50:42 config Macros = 71, Sorted = 71, StringBytes = 1828, TablesBytes = 1176
09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 CLASSAD_CACHING is OFF
09/09/16 18:50:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP
09/09/16 18:50:42 ** /usr/sbin/condor_shadow
09/09/16 18:50:42 Daemon Log is logging: D_ALWAYS D_ERROR
09/09/16 18:50:42 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
09/09/16 18:50:42 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
09/09/16 18:50:42 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
09/09/16 18:50:42 ** $CondorPlatform: x86_64_RedHat7 $
09/09/16 18:50:42 ** PID = 30541
09/09/16 18:50:42 ** Log last touched 9/9 18:50:42
09/09/16 18:50:42 ******************************************************
09/09/16 18:50:42 Using config source: /etc/condor/condor_config
09/09/16 18:50:42 Using local config sources:
09/09/16 18:50:42    /etc/condor/condor_config.local
09/09/16 18:50:42 config Macros = 71, Sorted = 71, StringBytes = 1828, TablesBytes = 1176
09/09/16 18:50:42 CLASSAD_CACHING is OFF
09/09/16 18:50:42 Daemon Log is logging: D_ALWAYS D_ERROR
09/09/16 18:50:42 Daemoncore: Listening at <0.0.0.0:19411> on TCP (ReliSock).
09/09/16 18:50:42 Daemoncore: Listening at <0.0.0.0:38254> on TCP (ReliSock).
09/09/16 18:50:42 DaemonCore: command socket at <10.0.7.10:19411?addrs=10.0.7.10-19411&noUDP>
09/09/16 18:50:42 DaemonCore: command socket at <10.0.7.10:38254?addrs=10.0.7.10-38254&noUDP>
09/09/16 18:50:42 DaemonCore: private command socket at <10.0.7.10:19411?addrs=10.0.7.10-19411>
09/09/16 18:50:42 DaemonCore: private command socket at <10.0.7.10:38254?addrs=10.0.7.10-38254>
09/09/16 18:50:42 Initializing a VANILLA shadow for job 81.1
09/09/16 18:50:42 Initializing a VANILLA shadow for job 81.0
09/09/16 18:50:42 (81.1) (30541): WriteUserLog::initialize: safe_open_wrapper("/nfs/data_disks/herc0b/users/zdhughes/simple.log") failed - errno 13 (Permission denied)
09/09/16 18:50:42 (81.1) (30541): WriteUserLog::initialize: failed to open file /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.1) (30541): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.1) (30541): Job 81.1 going into Hold state (code 22,0): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.1) (30541): RemoteResource::killStarter(): DCStartd object NULL!
09/09/16 18:50:42 (81.0) (30540): WriteUserLog::initialize: safe_open_wrapper("/nfs/data_disks/herc0b/users/zdhughes/simple.log") failed - errno 13 (Permission denied)
09/09/16 18:50:42 (81.0) (30540): WriteUserLog::initialize: failed to open file /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.0) (30540): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.0) (30540): Job 81.0 going into Hold state (code 22,0): Failed to initialize user log to /nfs/data_disks/herc0b/users/zdhughes/simple.log
09/09/16 18:50:42 (81.0) (30540): RemoteResource::killStarter(): DCStartd object NULL!
09/09/16 18:50:42 ******************************************************


etc. for all 30 instances of the job. I have an entry for my the disk in my export file:


/local_data0 herc*.lexas(rw) starscream.lexas(rw)


and I can read/write to the disk as a user. I have chmod 777 the directory zdhughes, which is were the program is located at and the files written to. Going deeper into the directory structure so that many parent directories also have full rwx access does nothing. Additionally, herc0 has a local account, labuser. When executing the (vanilla) job from its home directory the jobs on herc0 run normally (the jobs on other machines hold, as expected); but if I run the job locally from /local_data0/users/labuser/ and get the same thing:


09/09/16 19:06:20 (82.0) (31628): WriteUserLog::initialize: safe_open_wrapper("/local_data0/users/labuser/simple.log") failed - errno 13 (Permission denied)
09/09/16 19:06:20 (82.0) (31628): WriteUserLog::initialize: failed to open file /local_data0/users/labuser/simple.log
09/09/16 19:06:20 (82.0) (31628): Failed to initialize user log to /local_data0/users/labuser/simple.log
09/09/16 19:06:20 (82.0) (31628): Job 82.0 going into Hold state (code 22,0): Failed to initialize user log to /local_data0/users/labuser/simple.log
09/09/16 19:06:20 (82.0) (31628): RemoteResource::killStarter(): DCStartd object NULL!

Any ides?



Thanks,


Zach
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www-auth.cs.wisc.edu/lists/htcondor-users/attachments/20160910/2d0b71ca/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
HTCondor-users mailing list
HTCondor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

------------------------------

End of HTCondor-users Digest, Vol 34, Issue 11
**********************************************