[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Guiding machine choices for the parallel universe



Alan Woodland wrote:
Hi,

Is there a way to "guide" condor's choice of nodes to satisfy a
parallel universe job automatically? Basically the cost of
communication between all nodes in one of my pools is not equal
because of network topology. Given this I'm looking for a way to make
the Dedicated Scheduler aware of this and prefer to match nodes that
are close to each other on the network, but not prevent larger
parallel jobs using all the machines.

Clearly users could write a requirements =  or rank = line in their
job submission file, but I don't think it's reasonable or fair to
expect users to be doing this.

If you have a rank or requirements expression that can implement the policy you want, you can automatically add this to every submit by using the

SUBMIT_EXPRS config knob.


I was thinking of doing something along the lines of logically
dividing the nodes into n sub-pools (within which connectivity is good
between all the nodes), and giving each sub-pool a number. This would
then mean that an expression something like:

NEGOTIATOR_PRE_JOB_RANK = (MY.Universe == PARALLEL) *
((free_nodes_in_my_subpool - MY.Machine_count) * my_subpool_id)

Where sub-pool ID's were suitably large would achieve this. Obviously
this isn't syntactically correct just yet!

Actually in practice doing this is slightly harder than I'd hoped.

Firstly is it true to say that (False * x) == 0? and (True * x) == x?

Secondly how would I go about writing an expression that maps machine
names into some (pre-defined) sub-pool Id's? Or am I better off
putting that as a custom attribute in the startds ads?

I think the latter is an easier and cleaner approach.


Thirdly is MachineCount an attribute in parallel universe job
classads? I can't see it listed in
http://www.cs.wisc.edu/condor/manual/v7.0/Appendix_A_ClassAd.html

MinHosts is what you want.

Fourthly how could free_nodes_in_my_subpool be implemented?

Or generally is there a nicer way to solve this without topological
changes to the network or intervention from each user of the parallel
universe?

There's also the ParallelSchedulingGroups feature which may help.

http://www.cs.wisc.edu/condor/manual/v7.1/3_12Setting_Up.html#SECTION004128400000000000000

-Greg


Thanks,
Alan
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/