[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Multi-Slot settings for single nodes



Hi,

I have briefly looked into the archive, but have not found this. If i am
wrong, please give me a pointer into the archive.

Consider the following situation:

I have a few boxes with newly arrived quad-core CPUs. The usual set-up
would be to define 4 slots for that machine and each will get 25% of the
installed memory (say 2 GByte each).

However, from time to time a user comes up and tells me, I need to run a
program which requires 5-6 GByte of memory. As far as I know, this could
be achieved by

(a) the user sets that only slot1 of each node can be used (I have done
that with the vm setting). However, with this set-up you need to rely on
the fact that the other jobs running on this node need less than 2-3
GByte in total, otherwise the system will run heavily into swap.

(b) the admin of the cluster changes the set-up of the cluster
temporarily that there is only a single slot available on particular
nodes for some time.

I guess most people would agree to me that this is both suboptimal.

My question:

Would it be possible (maybe as a feature wish for a new version) to
allow the scheduler have knowledge both of slots as well as nodes?

E.g.:

My cluster has 100 quad-core nodes with above specification. My
possibilities are:

total number of slots | slots per node | memory per slot [GByte]
	100			1		8
	200			2		4
	400			4		2

Now imagine user A submitting a cluster of 4000 jobs which all need only
1 GByte (requirement set in submit file). They are running happily for
some time until user B arrives and submits 200 jobs with above
requirement of 6 GByte per slot.

If the admin now changes a few nodes, say 30, to single slot set-up and
both user's code can run happily (given all other requirements are
fulfilled). However, the admin has to watch this and might need to
change this again to enhance the throughput.

What my wish for Condor would be, that if user B is allowed to run
(userprio et al.), it starts to free up a few nodes and runs only user B
jobs there. This could be done forcefully by eviction or by simply
waiting until jobs are finished and don't restart jobs on that node.

If you are a devel of Condor and start to cry in agony right now because
this undermines the standard rules of node matching, I am very sorry.
But otherwise, I really which this would be possible.

Sorry, for the lengthy mail but I hope I made my problem clear.

Cheers

Carsten