[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FlexLM Licence + Group Quota + Preemption





On Sun, Sep 25, 2011 at 11:07 PM, Sassy Natan <sassyn@xxxxxxxxx> wrote:
Hi List,

I'm kind of lost here so maybe someone can provide me some guide lines here....

I have limited experience in implementing the preemption ability under condor so here it goes:

1. My condor cluster include 96 slots (6 machines, 16 cores each, 24GB of RAM).
2. My users are all belong to the same UID_DOMAIN
3. Condor is working great!
4. I have up to 13 floating license of Matlab in my FlexLM server.
5. I have implemented Group Quota for two groups:

  • Group X - limit of 13 slots
  • Group X.A - limit of 7 slots
  • Group X.B - limit of 6 slots
GROUP_AUTOREGROUP_X.A = Ture
GROUP_ACCEPT_SURPLUS_X.A = Ture
GROUP_AUTOREGROUP_X.B = Ture
GROUP_ACCEPT_SURPLUS_X.B = Ture

6. So when user from the X.A group submit 20 jobs to the queue, and no job are correctly allocated by users from the X.B group, then only 13 slots can be used by the X.A group.The 7 other jobs will get into an idle status, waiting for one of the other jobs to complete.
7. Once any of the 13 jobs will complete, the jobs waiting in the queue will start running, until the queue will be empty.
8. So far all is good :-) 
9. However, if a users from the X.B submit a job that need to use 3 slots, then these jobs will get higher priority in the queue compare to the jobs from the X.A group.
10. Now, if I understand currently setting preemption to disable should insure that once any of the current running jobs from the  X.A group will be finished, jobs from the X.B group will start
11. So to summaries what we have so far, here is what we have:

User X.A submitted - 20 jobs to the pool: 13 slots are allocated, 7 waiting in the queue.
................................
10 min later
................................
                              Queue status: 13 slots are allocated, 3 jobs witting in the queue. (4 jobs completed)
--------------------------------
10 min later
--------------------------------
Users X.B submitted - 3 jobs to the pool.
                              Queue status: 13 slots are allocated, 6 jobs witting in the queue. (3 from user X.A and 3 from users X.B)
--------------------------------
Once any of the running jobs is completed, jobs from the users X.B should start running.

12. Now my problem: I would like to implement Preemption since my users complaining about FlexLM timeouts. In other words, the following scenario happens with my current configuration which is note idle:

When user from the X.B group submitted a job, he would like to have his job start right away. He doesn't want to wait until one of the jobs from user X.A will finish (which can take around 2 hours). 
I do however want to keep the GROUP_ACCEPT_SURPLUS on since this improve the system performance.
I don't mind the job from the X.A will be killed once jobs from the X.B get into the queue. Checkpoint is a solution which I need to check out, but I want to make sure that once users from X.B sent jobs to the queue, jobs will start running until there quota limit (in my case 6 slots for X.B group).
The same thing should work for the X.A group.

How can this be done under condor?

Other thing is that maybe it will be a good idea to define some job category option, which means that jobs with this category option define, can kickoff jobs from the pool.
This is all due to the limit of my Matlab licences.

Any ideas?

Thanks 
Sassy