[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] parallel job fails to execute



OK right. As I used the full path, it was fine and the variables were seen by the condor with getenv=true.
The issue is that the program is multithreaded on a single machine. So, if the user specifies 4 cores in the program's scripts, that should be run on a machine with more than 4 free cores.

For example, on one node, I see one process with more than 300% cpu utilization which means it is fine!

[mahmood@rocks7 Downloads]$ top -b -n 1 | head -n 10 | tail -n 5

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10955 mahmood   20   0 5799152 1.056g   6864 R 355.2  6.7   6:03.32 l906.exe
 1523 root      20   0       0      0      0 S  41.4  0.0   0:04.76 nfsd
 2612 mahmood   20   0 2049516 255544  77544 S  31.0  1.6  13:11.78 gnome-shell





Now, for the condor script, I specified machine_count=4. However, I see four processes on the execute machines which seems to be wrong. At least it is not exactly the same as non-condor run. So I don't know if it running one instance of the program with one user input file or four parallel runs of a program with one user input file.

[mahmood@rocks7 Downloads]$ rocks run host compute-0-0 "top -b -n 1 | head -n 10 | tail -n 5"
Warning: untrusted X11 forwarding setup failed: xauth key data not generated

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2262 mahmood   20   0 5800192 3.455g   6900 R 212.5 17.9  32:37.43 l906.exe
 2261 mahmood   20   0 5800196 3.435g   6900 R 175.0 17.8  32:40.87 l906.exe
 3672 mahmood   20   0  157596   4200   3616 R   6.2  0.0   0:00.01 top
[1]+  Exit 1                  g09 test.gjf
[mahmood@rocks7 Downloads]$ rocks run host compute-0-1 "top -b -n 1 | head -n 10 | tail -n 5"
Warning: untrusted X11 forwarding setup failed: xauth key data not generated

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 3102 mahmood   20   0  157596   4216   3660 R  6.7  0.0   0:00.01 top
    1 root      20   0   51636   5188   3800 S  0.0  0.0   0:01.86 systemd
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd
^[[A[mahmood@rocks7 Downloads]$ rocks run host compute-0-2 "top -b -n 1 | head -n 10 | tail -n 5"
Warning: untrusted X11 forwarding setup failed: xauth key data not generated

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3399 mahmood   20   0 5813944 922472  10644 R 100.0 22.8  16:33.26 l502.exe
 3400 mahmood   20   0 5813944 918044  10308 R 100.0 22.7  16:33.73 l502.exe
    1 root      20   0   43444   5168   3800 S   0.0  0.1   0:02.49 systemd
[mahmood@rocks7 Downloads]$



If you look at the compute-0-0, two process with about 200% utilization are running.
I have to say that compute-0-0 has four free cores. So I expect that condor puts on the compute-0-0 only. But it didn't do that!

Any idea?


Regards,
Mahmood




On Tuesday, February 20, 2018, 9:50:24 AM EST, Jason Patton <jpatton@xxxxxxxxxxx> wrote:


condor_submit does not inspect the submit file to see if getenv is or
is not defined, but it does check that the executable exists, and it
does not use PATH to find the executable. Only the specific path you
give (relative, unless an absolute path is given) is checked for the
executable. So if you put "executable = g09", condor_submit looks for
g09 in the current directory. If it's not there, then you get the
error you mentioned in the first email.

There are many fundamental differences between HTCondor and other
batch scheduling systems, as condor was built with distributed
computing in mind. One of these differences is that condor does not
assume anything about the environment of the execute machine. (Why?
For example, in some condor pools, there may not be a shared file
system. Users may not have a home directory on the execute machines.)
When condor starts a job on an execute machine, the job starter
process creates the environment for the job based only on the
description of the submit file. The job starter process then runs the
executable, which is either (1) condor_exec.exe if the executable was
transferred or is (2) the exact file path given in the submit file if
the executable was not transferred. The job environment's PATH is not
considered by the job starter when running the executable.

The executable itself can make use of environment variables.

If you want to make sure your ~/.bashrc is sourced, you can use a
wrapper script (which can also use environment variables!):

--- wrapper.sh ---
#/usr/bin/env bash
source ~/.bashrc
g06 $@ # run g06 with all the arguments passed to the wrapper script

Then use wrapper.sh for the executable in your submit file. However,
if your .bashrc is the same on the execute nodes as they are on your
submit node, using getenv = true should be good enough.

Jason