[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Remote cluster test failed when using condor_remote_cluster command



Hi Jaime,
Firstly, lets focus on the problem of LSF cluster.
The lsf_ping.sh attached patch is probably edited on Windows system which means there is a inappropriate newline marker `\r`. Maybe you should check this script on HTCondor's Github Repo :). I have replaced them with linux newline marker in my computer. And i am sure that the modified script is copied to remote cluster correctly (I checked at lsf cluster /work/cse-liyf/bosco/glite/libexec/lsf_ping.sh). However this problem is not solved. The log of Gridmanager still shows `08/24/23 15:21:39 [2126584] resource cse-liyf@xxxxxxxxxxxx is still down`.

Here is my opinion.
`bhist -a` is definitely necessary. There is no output when using `bhist`. But I don't know which program/script invoke lsf_ping.sh. So I can't figure that whether there is a path error when execute lsf_ping.sh and how to check whether it load `dirname $0`/blah_load_config.sh` correctly. Can you provide more infomation about how lsf_ping.sh is executed (invoked by who, when invoked)? Maybe we can fix it together!

Let's set aside the port problem of Slurm cluster for a moment. Focus on the LSF now. At last, sincerely thank you for your patience and help.

Yifei Li



 
 
 
------------------ Original ------------------
Date:  Wed, Aug 23, 2023 10:50 PM
To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>;
Cc:  "Jaime Frey"<jfrey@xxxxxxxxxxx>;
Subject:  Re: [HTCondor-users] Remote cluster test failed when using condor_remote_cluster command
 
I suspect your issue on the new LSF cluster is a bug we recently fixed in how HTCondor detects whether the LSF scheduler is operational. Iâve attached a patched lsf_ping.sh script, which should replace the one in /usr/libexec/blahp/.

Your struggles exposed how poorly we handle systems where ssh is listening on an alternate port. Iâve fixed our code so that you can specify 'cse12232396@xxxxxxxxxxxx:10022' everywhere or specify just '172.18.34.19' and set the username and port via ~/.ssh/config. Iâve attached updated copies of condor_remote_cluster (to be placed in /usr/bin) and remote_gahp (to be placed in /usr/sbin).

Let me know if these updated files get things working for you.

- Jaime

On Aug 22, 2023, at 8:38 PM, Yifei Li <12232396@xxxxxxxxxxxxxxxxxxx> wrote:

Hi Jaime,
Thank you so much for explaining how HTCondor connect to the cluster login node!
I realized that my slurm cluster (172.18.34.19 port 10022) only open few ports for security. It's the reason why I can't run remote_gahp command successfully. I will get in touch with our administrator to solve this port problem.

Then i tired it on another lsf cluster (cse-liyf 172.18.6.178) which open 22 port and other ports. I can run remote_gahp command successfully now.
The remote_gahp connection log is shown at below.
(base) liyifei@ubuntu:~$ remote_gahp cse-liyf@xxxxxxxxxxxx condor_ft-gahp
Agent pid 1970883
CONDOR_INHERIT not defined, using bogus value 127.0.0.1:12345 for gridmanager address
Allocated port 45176 for remote forward to
$GahpVersion 2.0.1 Jul 30 2012 Condor_FT_GAHP $
...

However, when i test `condor_remote_cluster -t cse-liyf@xxxxxxxxxxxx`, it still does not work.  It shows that resource is still down, and the job at condor_q is idle. How to check whether the resource is up or down?
**GridManagerLog***
08/23/23 09:22:42 [1957663] resource cse-liyf@xxxxxxxxxxxx is still down
08/23/23 09:23:52 [1957663] Found job 13.0 --- inserting
08/23/23 09:23:52 [1957663] (13.0) doEvaluateState called: gmState GM_INIT, remoteState 0

Yifei Li

 
 
 
------------------ Original ------------------
From:  "Jaime Frey via HTCondor-users"<htcondor-users@xxxxxxxxxxx>;
Date:  Tue, Aug 22, 2023 11:59 PM
To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>;
Cc:  "Jaime Frey"<jfrey@xxxxxxxxxxx>;
Subject:  Re: [HTCondor-users] Remote cluster test failed when using condor_remote_cluster command
 
The remote_gahp command (used by condor_remote_cluster --test and when submitting a job with condor_submit) adds â-p 22â to the ssh command it runs. This overrides the setting in ~/.ssh/config but not a port number given as part of the hostname. I think we need to remove that in a future release.

- Jaime

On Aug 22, 2023, at 10:50 AM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

It looks like some of the commands that condor_remote_cluster runs for setup (like scp) donât work when specifying the port as part of the hostname.

It looks like youâre now running into a different problem. Try running this on the command line:

remote_gahp cse12232396@xxxxxxxxxxxx:10022 condor_ft-gahp

It is the same command that HTCondor uses to connect to the cluster login node and run a helper tool.
When itâs working, youâll see something like this:

% remote_gahp hpclogin3 condor_ft-gahp
Agent pid 2361257
CONDOR_INHERIT not defined, using bogus value 127.0.0.1:12345 for gridmanager address
Allocated port 48820 for remote forward to
$GahpVersion 2.0.1 Jul 30 2012 Condor_FT_GAHP $

You can hit Ctrl-C or type âquitâ to stop the helper tool.

 - Jaime

On Aug 22, 2023, at 9:07 AM, Yifei Li <12232396@xxxxxxxxxxxxxxxxxxx> wrote:

I tried to modify the condor submit script from `grid_resource = batch slurm cse12232396@xxxxxxxxxxxx` to `grid_resource = batch slurm cse12232396@xxxxxxxxxxxx:10022`. Here is a new error called `Failed to read GAHP server version`. Any thoughts? Looking forward to your reply!

*****log******
08/22/23 21:54:36 DaemonCore: command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618&alias=ubuntu&noUDP&sock=gridmanager_1737236_ab6d_4>
08/22/23 21:54:36 DaemonCore: private command socket at <127.0.0.1:9618?addrs=127.0.0.1-9618&alias=ubuntu&noUDP&sock=gridmanager_1737236_ab6d_4>
08/22/23 21:54:39 [1943860] Found job 10.0 --- inserting
08/22/23 21:54:39 [1943860] GAHP server pid = 1943863
08/22/23 21:54:41 [1943860] GAHP server pid = 1943877
08/22/23 21:54:42 [1943860] Failed to read GAHP server version
08/22/23 21:54:42 [1943860] Error starting 172.18.34.19:10022 transfer GAHP: Agent pid 1943869\n
08/22/23 21:54:42 [1943860] resource cse12232396@xxxxxxxxxxxx:10022 is now down
08/22/23 21:54:42 [1943860] (10.0) doEvaluateState called: gmState GM_INIT, remoteState 0
08/22/23 21:54:42 [1943860] Gahp Server (pid=1943877) exited with status 255 unexpectedly
08/22/23 21:54:44 [1943860] (10.0) doEvaluateState called: gmState GM_CLEAR_REQUEST, remoteStat




 
 
 
------------------ Original ------------------
From:  "Yifei Li"<12232396@xxxxxxxxxxxxxxxxxxx>;
Date:  Tue, Aug 22, 2023 09:54 PM
To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>;
Subject:  Re: [HTCondor-users]Remote cluster test failed when using condor_remote_cluster command
 
If we modify ~/.ssh/config, ssh client on linux can specify the port directly, which is used on the `condor_remote_cluster -a`. However, condor_submit may not use ssh client to dispatch jobs. So modification on ~/.ssh/config does not affect `condor_remote_cluster -t`. Is there anyway to specify the port for `condor_submit`? It may be helpful.

Yifei Li



 
------------------ Original ------------------
From:  "Yifei Li"<12232396@xxxxxxxxxxxxxxxxxxx>;
Date:  Tue, Aug 22, 2023 09:39 PM
To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>;
Subject:  Re: [HTCondor-users]Remote cluster test failed when using condor_remote_cluster command
 
Thanks for your reply! However this command is not useful. I am reading the source code of condor_remote_cluster. If you have any other suggestion about how to specify the port. Please let me know the news.

(base) liyifei@ubuntu:~$ condor_remote_cluster -a cse12232396@xxxxxxxxxxxx:10022 slurm
Enter the password to copy the ssh keys to cse12232396@xxxxxxxxxxxx:10022:
ssh: Could not resolve hostname 172.18.34.19:10022: Name or service not known

Yifei Li
 
 
------------------ Original ------------------
From:  "Jaime Frey via HTCondor-users"<htcondor-users@xxxxxxxxxxx>;
Date:  Tue, Aug 22, 2023 09:22 PM
To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>;
Cc:  "Jaime Frey"<jfrey@xxxxxxxxxxx>;
Subject:  Re: [HTCondor-users] Remote cluster test failed when using condor_remote_cluster command
 
I donât know why the alternate port number in ~/.ssh/config would work with --add but not --test. You can include the alternate port number in the hostname, like so:

condor_remote_cluster -a cse12232396@xxxxxxxxxxxx:12345 slurm
condor_remote_cluster -a cse12232396@xxxxxxxxxxxx:12345

grid_resource = batch slurm cse12232396@xxxxxxxxxxxx:12345

- Jaime

> On Aug 22, 2023, at 6:50 AM, Yifei Li <12232396@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Thank you so much!!
> I have installed remote cluster successfully. However our remote cluster's ssh port is not 22. How can i set the ssh port for remote cluster?  I have added the remote ssh info into ~/.ssh/config, which means i can use ssh command without specify port. However it does not work when using condor_remote_cluster -t(it works for condor_remote_cluster -add). Here is the log showing failed test.
>
> (base) liyifei@ubuntu:~$ condor_remote_cluster -t cse12232396@xxxxxxxxxxxx
> Testing ssh to cse12232396@xxxxxxxxxxxxxxxxxxxxx!
> Testing remote submission...Passed!
> Submission and log files for this job are in /home/liyifei/bosco-test/boscotest.aYtWz
> Waiting for jobmanager to accept job...Passed
> Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
> Showing last 5 lines of logs:
> 08/22/23 19:39:27 [1938276] Error starting 172.18.34.19 GAHP: Agent pid 1938292\nssh: connect to host 172.18.34.19 port 22: Connection refused\nAgent pid 1938292 killed\n
> 08/22/23 19:39:27 [1938276] resource cse12232396@xxxxxxxxxxxx is now down
> 08/22/23 19:39:27 [1938276] (6.0) doEvaluateState called: gmState GM_INIT, remoteState 0
> 08/22/23 19:39:27 [1938276] Gahp Server (pid=1938286) exited with status 255 unexpectedly
> 08/22/23 19:39:31 [1938276] (6.0) doEvaluateState called: gmState GM_CLEAR_REQUEST, remoteState 0
>
> Yifei Li
>
>
>
>
>    ------------------ Original ------------------
> From:  "Tim Theisen"<tim@xxxxxxxxxxx>;
> Date:  Tue, Aug 22, 2023 07:26 PM
> To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>; "Yifei Li"<12232396@xxxxxxxxxxxxxxxxxxx>; Subject:  Re: [HTCondor-users] Remote cluster test failed when using condor_remote_cluster command
>  When I checked this morning, the file server is back online.
> ...Tim
> On 8/21/23 22:19, Tim Theisen via HTCondor-users wrote:
>> I have confirmed that file server is currently not available. I will report back when it is operational.
>> ...Tim
>> On 8/21/23 20:45, Yifei Li wrote:
>>> Thanks for your reply!
>>> I am trying to use condor_remote_cluster under a regular account. But it seems that there is network error during downloading installation file. Is the file server shutdown? I downloaded it successfully several days ago. Could you check it for me? Thank you!
>>>
>>> ***Log***
>>> liyifei@ubuntu:~$ condor_remote_cluster --add cse12232396@1**** slurm
>>> Enter the password to copy the ssh keys to cse12232396@xxxxxxxxxxxx:
>>> Downloading release build for cse12232396@****..............................................................................................................................curl: (28) Failed to connect to research.cs.wisc.edu port 443: Connection timed out
>>> Failure
>>> Failed to download release build.
>>> Unable to download and prepare files for remote installation.
>>> Download URL: https://research.cs.wisc.edu/htcondor/tarball/10.x/10.7.0/release/condor-10.7.0-x86_64_AlmaLinux8-stripped.tar.gz
>>> Aborting installation to cse12232396@***.
>>>
>>> Yifei Li
>>>
>>>
>>>
>>>
>>>       ------------------ Original ------------------
>>> From:  "Jaime Frey via HTCondor-users"<htcondor-users@xxxxxxxxxxx>;
>>> Date:  Tue, Aug 22, 2023 05:24 AM
>>> To:  "htcondor-users"<htcondor-users@xxxxxxxxxxx>;
>>> Cc:  "Jaime Frey"<jfrey@xxxxxxxxxxx>;
>>> Subject:  Re: [HTCondor-users] Remote cluster test failed when using condor_remote_cluster command
>>>   The condor_remote_cluster command has to be run under the regular user account under which you will be submitting your workflow jobs. You donât run it as the root user.
>>>
>>> You can use condor_remote_cluster to access two different clusters simultaneously for your workflows. One thing to keep in mind is that each submit file must name the cluster that that job should be run on, like so:
>>>
>>> grid_resoruce = batch slurm cluster1.foo.edu
>>>
>>> If youâre using DAGMan, you can use the VARS command to set the cluster to use for a whole set of nodes in the DAG.
>>>
>>>  - Jaime
>>>
>>>> On Aug 19, 2023, at 2:22 AM, æéé <12232396@xxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Dear HTCondor development Team,
>>>>     I can access two campus clusters, which one is LSF based, the other is Slurm based. Since i am not a administrator of these cluster and i still want to use them to execute one workflow simultaneously, I think i can use condor_remote_cluster to achieve my goal. First question: Can I utilize the two cluster by HTCondor to execute a workflow simultaneously?
>>>>     Until now, I have done some effort to achieve my goal. I installed HTCondor(MiniCondor) on my PC workstation in the same local area network of campus clusters. I tried to use condor_remote_cluster command to add the LSF cluster and Slurm cluster. I added them successfully and they are shown in the remote cluster list. However, when I try to test using "condor_remote_cluster -t" command. The task can't be dispatched to the remote cluster. There will be an idle task in the condor_q.
>>>> Could you provide some suggestions to help me set up my environment? Is it possible for me to achieve my goals without root access of cluster? Looking forward to your reply.
>>>>
>>>> ****Log from my PC workstation****
>>>> root@ubuntu:~/bosco-test/boscotest.p3SGb# condor_remote_cluster -t cse-liyf@xxxxxxxxxxxx
>>>> Testing ssh to cse-liyf@xxxxxxxxxxxxxxxxxxxxx!
>>>> Testing remote submission...Passed!
>>>> Submission and log files for this job are in /root/bosco-test/boscotest.2DBlK
>>>> Waiting for jobmanager to accept job...Passed
>>>> Checking for submission to remote lsf cluster (could take ~30 seconds)...grep: /root/bosco-test/boscotest.2DBlK/logfile: No such file or directory
>>>> grep: /root/bosco-test/boscotest.2DBlK/logfile: No such file or directory
>>>> grep: /root/bosco-test/boscotest.2DBlK/logfile: No such file or directory
>>>> grep: /root/bosco-test/boscotest.2DBlK/logfile: No such file or directory
>>>> grep: /root/bosco-test/boscotest.2DBlK/logfile: No such file or directory
>>>> Then failed.
>>>>
>>>>
>>>> Yifei Li