[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Submitting to a remote condor queue



Hello,

I would like to run some code inside a docker container, which submits jobs to a condor schedd running on the underlying host outside the container.

+--------------------------+
| +-----------+            |
| | container | ---.       |
| +-----------+    v       |
|                schedd    |
|                          |
+--------------------------+

The final goal here is to run some fairly complex code which writes out DAGs, and have that code bundled together with all its dependencies in a docker container.

The docker container includes the docker binaries (e.g. condor_submit, condor_submit_dag) but no running htcondor daemons. If necessary, it can have a tweaked condor_config; or I can provide command-line options as required to point to the schedd.

More generally: I'd like to understand how to configure a host A which contains only the condor binaries (and no running daemons) to submit jobs to a remote host B where the daemons are running. But I'll limit myself to the docker-container-on-same-host case for the moment.

In my test setup, the outer host is ardb-dummy.int.example.net / 192.168.5.192. The container has address 172.17.0.9, but this is NAT'd to 192.168.5.192 by the time it reaches the outer host (i.e. tcpdump on the outer host shows traffic from 192.168.5.192 to 192.168.5.192 on the "lo" interface)

Both are running htcondor 8.5.1 ubuntu packages (https://research.cs.wisc.edu/htcondor/ubuntu/)


Here's how far I've got:

(1) I have condor_status working: it just needs the "-pool" flag.

root@fe1d7a934cdb:/# condor_status -pool ardb-dummy.int.example.net
Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle 0.880 1497  0+00:58:48
Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     1     0       0         1 0          0        0

               Total     1     0       0         1 0          0        0

brian@fe1d7a934cdb:~$ condor_status -pool ardb-dummy.int.example.net -any
MyType             TargetType         Name

Collector None Personal Condor at ardb-dummy.int.example
Scheduler          None ardb-dummy.int.example.net
DaemonMaster       None ardb-dummy.int.example.net
Negotiator         None ardb-dummy.int.example.net
Machine            Job slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx


(2) I have condor_q working, but for some reason it needs both "-pool" and "-name" flags.

root@fe1d7a934cdb:/# condor_q -pool ardb-dummy.int.example.net -name ardb-dummy.int.example.net


-- Schedd: ardb-dummy.int.example.net : <192.168.5.192:50022>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

(If I give only "-pool" then it appears to be still trying to talk to the local condor daemons, and failing)


(3) My problem now is getting condor_submit to work.

On the outer host, I have set FLOCK_FROM = 192.168.5.192. However when I try to submit from the container, I am getting authentication errors:

brian@fe1d7a934cdb:~$ condor_submit -pool ardb-dummy.int.example.net -name ardb-dummy.int.example.net sleep.sub
Submitting job(s)
ERROR: Failed to connect to queue manager ardb-dummy.int.example.net
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

Looking at logs on the outer host, /var/log/condor/SchedLog says:

02/08/16 13:03:53 (pid:4531) DC_AUTHENTICATE: authentication of <172.17.0.9:47024> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting. 02/08/16 13:03:53 (pid:4531) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:100). There is probably a problem with your credentials. (Did you run grid-proxy-init?)|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXeDGUPh)

(Note: my username "brian" exists in both container and outer host, with the same uid and gid)

I have been trying to follow some documentation:

https://indico.cern.ch/event/272794/session/2/contribution/17/attachments/490442/677971/HTCondor-Security-Overview.pdf

but am getting a bit lost as to which knobs control authentication between CLI tools and daemons, and which between daemons and daemons; and which to set on the docker/client side, and which on the host/schedd side.

So, by following this post:
https://www-auth.cs.wisc.edu/lists/htcondor-users/2013-February/msg00129.shtml

inside the container I have set in /etc/condor/condor_config.local:

SEC_PASSWORD_FILE = /etc/condor/pool_password
SEC_CLIENT_AUTHENTICATION = PREFERRED
SEC_CLIENT_AUTHENTICATION_METHODS = PASSWORD

and in the outer host's condor_config.local:

FLOCK_FROM = 192.168.5.192
SEC_DEFAULT_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD
SEC_WRITE_AUTHENTICATION = REQUIRED
SEC_WRITE_AUTHENTICATION_METHODS = PASSWORD
SEC_PASSWORD_FILE = /etc/condor/pool_password

and on both: echo "xyzzy" >/etc/condor/pool_password

This doesn't work. I get the following error on the client side:

brian@fe1d7a934cdb:~$ condor_submit -pool ardb-dummy.int.example.net -name ardb-dummy.int.example.net sleep.sub
Submitting job(s)
ERROR: Failed to connect to queue manager ardb-dummy.int.example.net
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using PASSWORD

And on the server side in SchedLog:

02/08/16 13:22:19 (pid:25919) DC_AUTHENTICATE: authentication of <172.17.0.9:60837> did not result in a valid mapped user name, which is required for this command (1112 QMGMT_WRITE_CMD), so aborting. 02/08/16 13:22:19 (pid:25919) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using PASSWORD

Actually, htcondor documentation says that password authentication is daemon-to-daemon only:
http://research.cs.wisc.edu/htcondor/manual/latest/3_6Security.html#sec:Password-Authentication
so this isn't going to work.

Next, looking at this document:
https://twiki.opensciencegrid.org/bin/view/CampusGrids/ConfiguringRemoteSubmissionHost
it says I also ought to set FLOCK_TO in the container, which I've now done (FLOCK_TO = 192.168.5.192), but that doesn't seem to make any difference.

Finally I tried changing the authentication to "PASSWORD,FS,CLAIMTOBE" on both sides, which gives the following:

brian@fe1d7a934cdb:~$ condor_submit -pool ardb-dummy.int.example.net -name ardb-dummy.int.example.net sleep.sub
Submitting job(s)
ERROR: Failed to connect to queue manager ardb-dummy.int.example.net
SECMAN:2010:Received "DENIED" from server for user brian using method CLAIMTOBE.
AUTHENTICATE:1004:Failed to authenticate using FS
AUTHENTICATE:1004:Failed to authenticate using PASSWORD

Server side:

root@ardb-dummy:~# grep -v 'Number of Active Workers' /var/log/condor/SchedLog | tail
...
02/08/16 13:33:02 (pid:27786) PERMISSION DENIED to brian from host 172.17.0.9 for command 1112 (QMGMT_WRITE_CMD), access level WRITE: reason: WRITE authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 172.17.0.9,172.17.0.9, hostname size = 1, original ip address = 172.17.0.9
02/08/16 13:33:02 (pid:27786) DC_AUTHENTICATE: Command not authorized, done!

Ah, that's new. So I changed the outer host to add this in the FLOCK_FROM range:

FLOCK_FROM = 192.168.5.192, 172.17.*
SEC_DEFAULT_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = PASSWORD,FS,CLAIMTOBE
SEC_WRITE_AUTHENTICATION = REQUIRED
SEC_WRITE_AUTHENTICATION_METHODS = PASSWORD,FS,CLAIMTOBE
SEC_PASSWORD_FILE = /etc/condor/pool_password

(although I've confirmed again with tcpdump the traffic source address is 192.168.5.192). No difference.

In desperation, I set ALLOW_WRITE=* on the outer host. Then if I also use "-remote" instead of "-name" on the command line, I get something which works:

brian@fe1d7a934cdb:~$ condor_submit -pool ardb-dummy.int.example.net -remote ardb-dummy.int.example.net sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 370721.

But this is most likely to be horrendously insecure. I believe it's relying on CLAIMTOBE (because if I remove it from the configs, it no longer works)

Does someone have any suggestion for how I should be doing this?

Thanks,

Brian Candler.

P.S. Docker containers can include arbitrary usernames/UIDs, so it would probably best if all jobs submitted by these containers were mapped to a single userID in htcondor land. I'm sure I remember reading some way to get

P.P.S. I realise that the DAG itself and the submit files used by that DAG also need to be visible on the host where dagman is running. I expect I'll use a docker volume to expose some bit of host filesystem for this purpose. But first I want to be clear on the right way to submit simple jobs remotely.