[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



If you want to get output from multiple nodes of a parallel universe
job, you'll need to include the $(Node) macro as part of your
output/error file names. There are some examples in the manual:
http://research.cs.wisc.edu/htcondor/manual/current/2_9Parallel_Applications.html

I really recommend thoroughly reading that page in the manual, it
address a few use cases (e.g. making sure the entire job isn't taken
down by Node 0 exiting early, requesting multiple cpu cores) that may
be relevant for future jobs.

However, with Open MPI jobs, all non-error/debug output should be
directed to node 0, which is the only node on which mpirun is
executed. The output you sent looks good and matches your submit
file... machine_count = 2 so you only get output from two nodes (with
one cpu core each by default). Those nodes may be two slots on the
same machine, which seems to be what happened in your case (your job
landed on two slots within compute-0-1).

Jason

On Fri, Jan 26, 2018 at 2:28 AM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> In the out file, I just see the output from compute-0-1. Why compute-0-0
> didn't respond?
>
> [mahmood@rocks7 ~]$ cat hellompi.err
> mkdir: cannot create directory '/var/opt/condor/execute/dir_26657/tmp': File
> exists
> mkdir: cannot create directory '/var/opt/condor/execute/dir_26656/tmp': File
> exists
> [mahmood@rocks7 ~]$ cat hellompi.out
> Hello world from processor compute-0-1.local, rank 0 out of 2 processors
> Hello world from processor compute-0-1.local, rank 1 out of 2 processors
> [mahmood@rocks7 ~]$ cat mpi.ht
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out
> error = hellompi.err
> machine_count = 2
> queue
> [mahmood@rocks7 ~]$ condor_status -af:h Machine DedicatedScheduler
> Machine           DedicatedScheduler
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> [mahmood@rocks7 ~]$ ssh compute-0-0 'grep MOUNT_UNDER_SCRATCH
> /opt/condor/etc/condor_config.local'
> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
> MOUNT_UNDER_SCRATCH=/tmp
> [mahmood@rocks7 ~]$
>
>
>
>
> Regards,
> Mahmood
>
>
> On Thursday, January 25, 2018, 2:18:31 PM EST, Jason Patton
> <jpatton@xxxxxxxxxxx> wrote:
>
>
> The mkdir error is an annoyance/bug and shouldn't have any effect on
> the rest of the script. (This annoyance is fixed in the 8.7.4+.) Did
> you get the output you were expecting?
>
> Jason
>