[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] multicore and multinode run



Let's take your example with a condor pool containing:

Machine1 with 1 core
Machine2 with 2 cores

If you submit a parallel universe job with "machine_count = 3" as the
only requirement, condor will try to schedule three slots in the pool
with one core each. This will work **if** your pool is configured to
have one static slot per core (the default) or if you are using
partitionable slots. However, if your pool is configured with a single
static slot on each machine (perhaps with each slot containing all of
the cores), then your job will not match because you will only have
two slots -- one on Machine1 with one core, one on Machine2 with two
cores.

It's difficult to address if specific examples will work without
knowing exactly how your pool is configured.

Based on the condor_status output you've provided in these threads, it
seems that you have one static slot per core. This means that you can
only submit jobs that request a single core per slot (or per
node/machine in the parallel universe), but you can request as many
nodes (machine_count) as you want up to the number of slots in your
pool.

If you want to submit jobs that request more than a single core (e.g.
request_cpus = 2), then you will need to reconfigure your pool to have
more than one core per slot or consider using partitionable slots.
Here's the manual for configuring slots (Section 3.5.10):
https://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html

Jason

On Fri, Jan 26, 2018 at 9:13 AM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> I forgot to say that the example âRequesting multiple cores per slotâ in the
> document is good for the first case. But I doubt if it helps with my second
> case in the previous email.
>
>
>
> Just want to be sure about that example in the manual.
>
>
>
> Regards,
>
> Mahmood
>
>
>
> From: Jason Patton
> Sent: Friday, January 26, 2018 6:16 PM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] multicore and multinode run
>
>
>
> If you want to get output from multiple nodes of a parallel universe
>
> job, you'll need to include the $(Node) macro as part of your
>
> output/error file names. There are some examples in the manual:
>
> http://research.cs.wisc.edu/htcondor/manual/current/2_9Parallel_Applications.html
>
>
>
> I really recommend thoroughly reading that page in the manual, it
>
> address a few use cases (e.g. making sure the entire job isn't taken
>
> down by Node 0 exiting early, requesting multiple cpu cores) that may
>
> be relevant for future jobs.
>
>
>
> However, with Open MPI jobs, all non-error/debug output should be
>
> directed to node 0, which is the only node on which mpirun is
>
> executed. The output you sent looks good and matches your submit
>
> file... machine_count = 2 so you only get output from two nodes (with
>
> one cpu core each by default). Those nodes may be two slots on the
>
> same machine, which seems to be what happened in your case (your job
>
> landed on two slots within compute-0-1).
>
>
>
> Jason
>
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/