[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] setting up dedicated pool for parallel universe



Dear Greg,

Thank you for your response. I have changed the IP to hostname. The simple job with sleep seems to be working. Now testing with proper mpi jobs with mpirun. Having trouble but digging for now.

Besides, as you have rightly guessed, what we want is being able to run MPI and openMP jobs on multicores on a single machines. I just could not get the proper configuration and job submit script so far, therefore ended up trying dedicated scheduler. What is the best way to achieve our goal. We are not imagining MPI jobs across two machines, multiple cores on one machine is enough for us.

Thanks,
Kodanda

On Wed, 26 Dec 2018 at 22:40, Greg Thain <gthain@xxxxxxxxxxx> wrote:
On 12/26/18 10:09 AM, Kodanda Ram Mangipudi wrote:
Hi,

I am a newbie for htcondor set-up and administration though I was a user in the past.Â
We are trying to set up a pool of 2 machines each with dual CPUs with 20 cores each. I made an i5 6 core machine as the master, and the other 2 HPC workstations as the nodes.All are running Ubuntu 16.04. The condor installation is from the default Ubuntu repositories installed using apt-get.


First, note that you only need to setup a dedicated scheduler and submit MPI jobs with the parallel universe if you want to have MPI jobs ru n concurrently on more than one machine. If you just want your MPI jobs to run on multiple cores on one machine (which is always the fastest kind of interconnects), you can use the vanilla universe.




node01
File: /etc/condor/condor_confic (Package manager's copy; Identical to that of master node)ÂÂ
/etc/condor/config.d/00debconf (Edited for configuration)
-------------------------------------Begin file ----------------------------------ÂÂ

# Added: by system admin: For Dedicated scheduler for parallel universe
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxx"


On the nodes, you want the part after the @ sign to be the condor name of the schedd, which is probably not the ip address. You can find the name of a schedd by running


condor_status -sched

and the first column will be the condor name of the schedd. Put that after the @ sign in the config file, and restart, and I think things will work better.


-greg


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/