[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor high availability


Currently jobs are being submitted and executed like they should, but they
don't complete because of the problems described in my last e-mail. I
checked every Log, every folder for irregularities and compared the running
configuration to the declarations in the config files. While I have
discovered that CONDOR_HOST has been set incorrectly due to 00debconf
overwriting our own value, that didn't have anything to do with the problem
at hand.

Maybe anyone got any idea why we get these messages about permission errors,
although there is no user "condor" in our cluster? The problem really is
annoying since we have to remove jobs manually after checking their logs for

Kind regards
Christian Hennen

-----Ursprüngliche Nachricht-----
Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
Hennen, Christian
Gesendet: Montag, 19. Oktober 2020 12:33
An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Betreff: Re: [HTCondor-users] HTCondor high availability

Hi again,

two other settings seem to be set to "non-standard" values, apart from the
allow list. LOCAL_DIR is set to /var/condor instead of /var and CONDOR_IDS
is set to 1000.1000 ( user r-admin). 

While that worked in the previous iteration of the cluster, jobs now switch
back and forth from running to idle and vice versa. 
SchedLog says "SetEffectiveOwner security violation: setting owner to
r-admin when active owner is "condor""
ShadowLog says "SetEffectiveOwner(r-admin) failed with errno=13: Permission

How do I find out which folders/files are affected and maybe have wrong

There are no files owned by "condor" on the network share /clients (where
the spool and FS_REMOTE dirs now reside) and none relevant on the local hard
drive. The only files found by "find / -xdev -user condor" are the standard
condor directories under /var, which are all empty (due to LOCAL_DIR being
set to /var/condor).

Kind regards
Christian Hennen

-----Ursprüngliche Nachricht-----
Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
Hennen, Christian
Gesendet: Mittwoch, 14. Oktober 2020 17:58
An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Betreff: Re: [HTCondor-users] HTCondor high availability

Hi Todd,

exactly. While obviously security is important and has nothing to do with
the HA setup itself, it was a surprise to me to have to configure security
for the communication between the masters. That's mainly because I
"inherited" this cluster and the original config contained * in the allow
list, so I never experienced these type of issues. Securing the HTCondor
part of the cluster is now added to my list of planned security changes :)
For now, since the cluster is completely separated from the rest of the
network, a working job processing and high availability of all services, was
more of a priority.

now works as expected and jobs can be submitted and started. They change to
Idle after a while, but maybe that's not related to the HTCondor config.

Kind regards
Christian Hennen

-----Ursprüngliche Nachricht-----
Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von Todd
L Miller
Gesendet: Freitag, 9. Oktober 2020 22:21
An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Betreff: Re: [HTCondor-users] HTCondor high availability

> Do I need to configure any other authentication methods in addition to 
> all servers using LDAP via PAM ?

 	Yes, of course.  Security between different nodes has nothing to
with how users log in.

> I tried to set the variable as you suggested, to no avail. Master2 now 
> says it can't connect to master1 ("Failed to fetch ads")

 	From your description, master1 is the original "master" node.  I
don't know if HAD will work for machines that are both submit nodes and
central managers, but for now let's assume that it will.  Note that HA
instructions do NOT address security at all; that's deliberate, because
security is complicated and nothing in HA changes anything about how your
security should work, except the addition of another server.  It's a bit
more of surprise to you, perhaps, because you didn't separate your central
manager from your submit server (and thus FS worked for all your
client-to-daemon connections).

 	From your serverfault question, it looks like you basically don't
have any security at all -- your ALLOW lists include *, so the problem must
be in authentication, not authorization.

 	Note that condor_q, by default in recent HTCondor versions, requires
authentication so that it only returns the jobs of the user who ran the
command.  Try running 'condor_q -all-users'; I think that will use a
different command that doesn't require authentication.

 	For this purpose, given that you know that the two masters share a
filesystem and user IDs, REMOTE_FS is not a bad choice.  You'll need to set
SEC_DEFAULT_AUTHENTICATION_METHODS on master1 and master2 to include FS and
REMOTE_FS; I would remove KERBEROS (since you're not using it). 
Both master1 and master2 need to set FS_REMOTE_DIR to the same value.  Be
sure to restart HTCondor on both machines after you've done that (I can't
keep straight which configuration changes only require a reconfig).  Try
running condor_q again; it should work.  If it doesn't, try running


and we'll see what we can see.

- ToddM
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Attachment: smime.p7s
Description: S/MIME cryptographic signature