[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] single machine(slot) vanilla universe mpirun issue



Hi Max,

It's strange that orte cares about the plm_rsh_agent when running on a single machine. A couple thoughts:

1. Try blanking the plm_rsh_agent parameter: --mca plm_rsh_agent ""
2. Sometimes the PATH inside a condor job might be missing or set improperly (maybe that's why it can't find ssh), could you try wrapping your executable in a script that does "export PATH" before running mpirun? You might even want to have your job run "env" and check your output file to see what kind of environment it's running in.
3. If your pool isn't set up to use a shared filesystem, condor mightÂbe transferring your mpirun executable. If your access point is at all different from your execution point, you might consider adding "transfer_executable = false" to your submit file to make sure you're using the mpirun executable on the execution point.

Let us know how it goes!

Jason Patton

On Fri, Sep 16, 2022 at 12:12 AM Zhen Song <snoopy1007@xxxxxxxxxxx> wrote:
Dear friends,

Issue:
I have 3 clusters already installed openmpi. The compiled mpi code works fine locally on each cluster. However, when I tried to use condor_submit, I got the following error:

--------------------------------------------------------------------------
                               âThe value of the MCA parameter "plm_rsh_agent" was set to a path
                               âthat could not be found:
                               â
                               â Âplm_rsh_agent: ssh : rsh
                               â
                               âPlease either unset the parameter, or check that the path is correct
âââââââââââââââââââââââââââââââââââââ

The submitting script is as follows, which I edited based onÂhttps://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html#mpi-applications-within-htcondor-s-vanilla-universeÂ:

universe  = vanilla
executable = /usr/bin/mpirun
requestMemory = 1024
request_GPUs = 4
request_cpus = 4
arguments = -np 4 ./cmake_tmp/bin/main 0 1 500
log    Â= logs/job_$(Cluster).$(Process).log
output   = logs/job_$(Cluster).$(Process).out
error   Â= logs/job_$(Cluster).$(Process).error
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = ./cmake_tmp/bin/main
queue

Attempts:Â
I tried "arguments = -mca plm_rsh_agent /usr/lib/condor/libexec/condor_ssh -np 4 ./cmake_tmp/bin/main 0 1 500â. I got errors:

[huashan:596121] [[50260,0],0] ORTE_ERROR_LOG: Not found in file plm_rsh_module.c at line 231
[huashan:596121] [[50260,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 528
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 orte_plm_init failed
 --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

Could anyone please help me sort it out?

Many thanks!

Best,
Max
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/