[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [HTCondor-Users] Parallel job MPICH implementation RANK.



Vikrant,

One way you can go about having parallel universe jobs fill slots on machines in a depth-first order is to have your machines advertise some sequence of numbers (one unique number per machine) in an attribute in the startd classads and to use a rank _expression_ in the submit file to target that attribute.


For example, your execute machines could have...

condor_config.local on machineA:
PARALLEL_RANK = 100
STARTD_ATTRS = $(STARTD_ATTRS) PARALLEL_RANK

condor_config.local on machineB:
PARALLEL_RANK = 99
STARTD_ATTRS = $(STARTD_ATTRS) PARALLEL_RANK

condor_config.local on machineC:
PARALLEL_RANK = 98
STARTD_ATTRS = $(STARTD_ATTRS) PARALLEL_RANK


and then your parallel universe job's submit file could have...

rank = TARGET.PARALLEL_RANK


The dedicated scheduler will try to match your job to slots where the rank _expression_ is highest first, so machineA would have its slots filled first, then machineB, then machineC, and so on.


Jason

On Thu, Aug 6, 2020 at 12:06 AM <ervikrant06@xxxxxxxxx> wrote:
Hi Jason, 

Thanks for your response.

Problem is that it seems like with machine_count HTCondor follows the pattern of filling breadth first instead of depth. To fill the depth first we reduce the machine_count and increase the request_cpus but that impacts the RANK count. 

I am looking for a way to fill the depth first of the pool without impacting RANK. 


Thanks & Regards,
Vikrant Aggarwal


On Tue, Aug 4, 2020 at 8:45 PM Jason Patton <jpatton@xxxxxxxxxxx> wrote:
Hi Vikrant,

Despite its name, "machine_count" does not necessarily have to do with the number of physical/virtual machines that condor will schedule your job on. "machine_count" tells condor the total number of *slots* that the job should occupy. Suppose you have a job with machine_count = 4... if you have 4 open slots on a single machine in your condor pool, your entire "machine_count = 4" job may be scheduled on that single machine. In that case, mp1script will run mpirun with 4 CPU ranks, but all the ranks will be on a single machine.

(The name "machine_count" is a bit outdated, going back to the days where there was usually only one CPU core per machine in a typical condor pool.)

Hopefully this helps, though I may have misunderstood your question.

Jason Patton

On Tue, Aug 4, 2020 at 3:17 AM <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

I was not able to find information from docs which can help me with my queries. 

Any input is highly appreciated.

Thanks & Regards,
Vikrant Aggarwal


On Wed, Jul 29, 2020 at 6:58 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

Any thoughts..

Thanks & Regards,
Vikrant Aggarwal


On Mon, Jul 27, 2020 at 4:24 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Condor Experts,

We are running parallel jobs in a cloud environment using MPICH implementation mp1script. We wanted to pack the parallel job to minimum hosts to avoid cost in the cloud. We have used machine_count and request_cpus to achieve it but changing machine_count directly impacts the RANK of jobs. We wanted to keep RANK of jobs at a higher value. TBH, I am not sure about the advantage of it. Please enlighten me if anyone has information about the usage of RANK.

While going through the documentation I found.

The macro $(Node) is similar to the MPI rank construct

How could we achieve both keeping the MPI jobs on a minimal number of hosts and with higher RANK value?

Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/