[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] single machine(slot) vanilla universe mpirun issue



Dear friends,

Issue:
I have 3 clusters already installed openmpi. The compiled mpi code works fine locally on each cluster. However, when I tried to use condor_submit, I got the following error:

--------------------------------------------------------------------------
                                                              âThe value of the MCA parameter "plm_rsh_agent" was set to a path
                                                              âthat could not be found:
                                                              â
                                                              â  plm_rsh_agent: ssh : rsh
                                                              â
                                                              âPlease either unset the parameter, or check that the path is correct
âââââââââââââââââââââââââââââââââââââ

The submitting script is as follows, which I edited based on https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html#mpi-applications-within-htcondor-s-vanilla-universe :

universe   = vanilla
executable = /usr/bin/mpirun
requestMemory = 1024
request_GPUs = 4
request_cpus = 4
arguments = -np 4 ./cmake_tmp/bin/main 0 1 500
log        = logs/job_$(Cluster).$(Process).log
output     = logs/job_$(Cluster).$(Process).out
error      = logs/job_$(Cluster).$(Process).error
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = ./cmake_tmp/bin/main
queue

Attempts: 
I tried "arguments = -mca plm_rsh_agent /usr/lib/condor/libexec/condor_ssh -np 4 ./cmake_tmp/bin/main 0 1 500â. I got errors:

[huashan:596121] [[50260,0],0] ORTE_ERROR_LOG: Not found in file plm_rsh_module.c at line 231
[huashan:596121] [[50260,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 528
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_init failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

Could anyone please help me sort it out?

Many thanks!

Best,
Max