[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cannot use multiple arguments to run MPI application in "parallel" universe



Thank you for the detailed explanation, Jason!

hufh

On Sat, Nov 17, 2018 at 12:10 AM Jason Patton <jpatton@xxxxxxxxxxx> wrote:
Your confusion is warranted, we haven't always be great at naming things consistently in HTCondor. :)

From the perspective of the condor daemons (the Schedd, Startd, Collector, etc.), a machine is what you said, an individual computer. The ClassAd attribute "Machine" will (usually) be the hostname of the local machine.

From the perspective of the condor command line utilities, like "condor_q" and "condor_status", a machine is a node is a slot... they're all the same thing, a slot in the condor pool. Similarly, with a parallel universe job, when you request "machine_count = 2" you're telling the dedicated scheduler to schedule your jobÂon two slots from the condor pool.

If you're using partitionable slots, though, the number of "machines" (slots) that condor_q or condor_status reports may not actually match the number of slots that could exist in your pool. If you have one partitionable slot that "owns" 20 cores and nothing is running in your pool, condor_status will report 1 idle machine (slot), but once you submit 20 single core jobs, that slot will get partitioned into 20 slots and condor_status will report 20 claimed machines (slots).

Jason

On Fri, Nov 16, 2018 at 9:55 AM hufh <hufh2004@xxxxxxxxx> wrote:
Jason,

Now i can run your MPI program with correct output. Thank you so much!

I am a little bit confused by the concept "machine". In this presentation,Âhttps://meetings.internet2.edu/media/medialibrary/2015/10/19/20151008-thain-htcondor-admin-tutorial.pdf
it says: "Machine â An individual computer, managed by one startd", this means "machine" is a physical machine.

but when I run condor_q on my 24-core server(I have only this server), i got result as follows:
          ÂMachines Owner Claimed Unclaimed Matched Preempting Drain
    X86_64/LINUX   Â24  Â0   Â4    20   Â0     0   0
       ÂTotal   Â24  Â0   Â4    20   Â0     0   0
Here "machines" is 24, it means it's not a "physical" machine, but a core or a slot.

Could you please clarify for me? In addition, what does node mean? My condor version is 8.6.12 for CentOS.

hufh





On Fri, Nov 16, 2018 at 12:41 AM Jason Patton <jpatton@xxxxxxxxxxx> wrote:
On Thu, Nov 15, 2018 at 10:31 AM hufh <hufh2004@xxxxxxxxx> wrote:
Hi Jason,Â

Is "a.out" in your script a MPI program?

Yes. It has to be referenced in both the submit file (to be transferred to the execute node) and the wrapper script (to be exec'd).

Here's my code for reference:

---
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
 MPI_Init(NULL, NULL);

 // number of processes
 int world_size;
 MPI_Comm_size(MPI_COMM_WORLD, &world_size);

 // rank of the this process
 int world_rank;
 MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

 // name of this processor
 char processor_name[MPI_MAX_PROCESSOR_NAME];
 int name_len;
 MPI_Get_processor_name(processor_name, &name_len);

 // print hello world message
 printf("Hello world from processor %s, rank %d out of %d processors\n",
    Âprocessor_name, world_rank, world_size);

 // print arguments, one on each line
 for (int i = 1; i < argc; ++i) {
  printf("I was given argument %s\n",
argv[i]);
 }

 sleep(5);

 MPI_Finalize();
}
---

Jason
Â

hufh

On Thu, Nov 15, 2018 at 11:03 PM Jason Patton <jpatton@xxxxxxxxxxx> wrote:
Here's my submit file:

---
universe = parallel

executable = openmpiscript
arguments = mpi_wrapper.sh
transfer_input_files = a.out, mpi_wrapper.sh
getenv = true

should_transfer_files = yes
when_to_transfer_output = on_exit_or_evict
+ParallelShutdownPolicy = "WAIT_FOR_ALL"

output = out.$(NODE)
error = err.$(NODE)
log  = log

request_cpus = 1
machine_count = 4

queue
---

Here's mpi_wrapper.sh:

---
#!/bin/sh

if [ "$_CONDOR_PROCNO" -lt 2 ]; then
  exec ./a.out '_CONDOR_PROCNO='$_CONDOR_PROCNO args1
else
  exec ./a.out '_CONDOR_PROCNO='$_CONDOR_PROCNO args2
fi
---

I'm using $_CONDOR_PROCNO to figure out which node of my MPI job is running and passing arguments to my MPI application (a.out) based on its value.

Jason



On Thu, Nov 15, 2018 at 6:12 AM hufh <hufh2004@xxxxxxxxx> wrote:
Hi Jason,

Sorry for late reply. I have tried your method, but it didn't work. Could you please send me your submit file and other stuff so that I can try it on my machines.

Thanks for your help!

hufh
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/