[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job to a flocked job

On 8/9/2022 9:50 AM, Matthias Schnepf wrote:


Hi all,
We have two HTCondor pools and flock jobs from one cluster to the other. The submit node runs with 9.1.2, while the worker nodes we flock to run 9.0.13. I'll try condor_ssh_to_job to a running flocked job at the other pool. The jobs run inside a docker container as user nobody.
When I use condor_ssh_to_job as root user on the submit machine, it works fine, and I'm inside the docker container. Independent of whom submitted the job.
When an ordinary user tries to ssh into a flocked job, it gets after a while, "Failed to connect to starter". condor_ssh_to_job works fine within the cluster the job was submitted.

I looked at the StarterLog (see below), and it seems that it gets stuck by ordinary users. After "Created security session for job owner", the starter queries docker regularly but nothing else. After "Created security session for job owner" condor runs a "docker exec -it ..." when the user root runs condor_ssh_to_job.

Could this be a problem with authentication? I did not find any security message in the logs that looks problematic.

Best regards,

Hi Matthias,

Given the information you provided above, especially the clue about how it works fine if you run condor_ssh_to_job, I have a good guess about what is happening here.  I am also guessing that your submit machine has firewall rules setup to deny incoming ephemeral ports, and you do not want to change your firewall rules. If so, my guess is you can get condor_ssh_to_job to work for regular users just as it does now for root by performing the following chmod command in your submit machine:
    sudo chmod 1777 `condor_config_val DAEMON_SOCKET_DIR`

Take a look at the documentation in the Manual for config knob DAEMON_SOCKET_DIR here for an explanation about why this works:

Feel free to follow-up with any questions.

Hope this helps,