[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor with VLSI Tools



Hi Sassy,

My background is in EDA and I started out building Condor-based pools for FPGA development. I feel your pain, I really do.

The bad news: there really isn't one solution out there, paid or free, that fits well with modern VLSI development practices. Largely this is due to the fact that design methodologies vary widely from company to company, even from group to group. But lets see what we can do to get you to a happier place...

On Tuesday, 6 December, 2011 at 2:08 PM, Sassy Natan wrote:

I would like to share with you some of my problems, so maybe someone have an idea how to achieve what I need:

1. The cluster include 20 machines with 24 core each. so total of 480 cores.
2. Each machine has 24GB of RAM.
3. All machines are connected to a NetApp File Server over NFS.
4. All machines are running RHES 6.0 and belong to the same UID domain.

Now,
My users would like to have the cluster managing there jobs as followed:

They would like to have two kind of jobs:
1. Jobs that run right away when submitted
2. Jobs that run in certion scenarios (more below)
1 can be hard to implement without incurring wasted resources. If the system is full of type 2 jobs and a 1 job comes along do you preempt? If so you're essentially losing all the time the preempted job has spent on the CPU and using the license. If you suspend, how do you get FlexLM to release the license? Some licenses will release themselves, some don't, it's really inconsistent from vendor to vendor.
However all jobs is depend on a FlexLM license (Matlab, Synopsys VCS etc....)
This is the tricky bit. FlexLM doesn't play nice with *any* job scheduling system. LSF has some loose integration but it's about on par with Condor's resource counters. And of course, none of the technologies do much to force licenses to be released when you suspend jobs (but I have a suggestion for this I'll get to in a bit…).

The good news: you're running on Linux. That's excellent. Without Windows in there to gum up the works there are some interesting options you can look at to help maximize your license use. Note: I'm assuming tool use is the ultimate goal for your group, that your tool costs are the most expensive resource in your pool by several orders of magnitude (I know at my last employer we paid several million dollar a year for tools, the machines we ran them on were, in comparison, essentially free with the tool license). 
So say I have 100 licenses of Matlab, and I want to share the licenses in a specific way based on the type of jobs so I would have the following:


1. When users submit a job (that can divided into 400 jobs) I would like him to limit the number of parallel jobs ( so he will not get all the licenses and will leave some for other users) 
Setting concurrency limits is not a good option here, since it is a global definition and not per user. It is true that this can limit the number of parallel jobs (if the limit is reach) but it cannot prevent from user to get all the license available.
So first piece of advice: control submissions. And by that I mean don't let your users access condor_submit directly. All submissions need to go through a proxy. Once you've got a portal, you've got a way to enforce rules on your submissions such that you can implement policies that can't be circumvented. Obviously I'm biased, I'm going to recommend CycleServer, but you could just as well write your own command line submission interface in a language of your choice. The end goal is controlling what people can put in the queue and enforcing consistency and rules on your submission tickets so people can't gum up the works.
 
As for parallelism: you could look at implementing per-user concurrency. This works if you control the submission interface. You can inject a custom attribute in to the submission that references a counter that's unique to the tool and the user. Now you've got a hard limit on how many jobs the user can run that use a tool without having to involve DAGMan.

You could also look at using quotas here. I realize you've tried this already, but wait for 7.6.5 and I think you'll be able to use the hierarchical quotas you desire. In your case a top-level quota that defines the total number of licenses for a tool, and then sub-quotas for each user or group to limit their ability to completely consume those licenses, makes sense to me.
2. I want that some jobs will run only if the FlexLM license has minimum 10 free licenses not in use. this will insure that real time jobs will start once they are submitted cause they have a free lic. I don't know how to achieve this. 
This one is tough. I'll throw out some imperfect solution suggestions. If you want to brainstorm here on the list, go ahead, but these should help give you some ideas for possible solutions. You can have all batch jobs that use this license submit held (remember: you're controlling the submission interface so you can enforce these kinds of rules now). And you can run a schedd cron job that calls lmstat, looks for licenses, and then release an appropriate number of held jobs if the lmstat call returns a particular value. The obvious downside to this approach is that releasing a job doesn't mean the job is running, consuming the license, so the license counts could change. But you could make these types of jobs preempt all others or just live with this slight incongruence in your system. All of this is best done if you keep all jobs that need to be managed this way on one schedd BTW. This gets exponentially harder to manage as you add more schedd daemons to your pool or if you spread job types out across schedd daemons.

The other option is to have a schedd cron job adjust the quotas or the resource counters instead of releasing these jobs. This plays a little nicer with jobs that don't start right away. You can increase availability and if nothing starts up, decreasing it is safe and easy to do. If you reduce it below the number of running jobs you can even have your cron job look at potentially evicting jobs to bring the use count back down. Or you can let them run to completion. You'll want to make sure what ever algorithms you use here are sufficiently damped so you don't get wild swings in state as availability changes. Small state changes at moderate intervals are what you're after here.

And yet one more option is to schedule interactive jobs via Condor. This ensures that the mechanism managing all your license requests is first and foremost the Condor negotiator. It lets you do better accounting within Condor. You just need to make sure interactive jobs preempt batch jobs quickly enough to satisfy your users. If you're all Linux throughout your office it's easy enough to wrap up "interactive" jobs so they run on remote nodes (or even desktops) and then export their display back to the user's X session before running the tool or presenting them with an xterm from which they can call up the tool. This is how I did it my last place of work (more or less) and users could either have their interactive jobs queue to run on their local machine or queue to run on a dedicated, interactive-only job machine in my pool. Worked pretty well. I don't recommend letting interactive and batch jobs commingle. Batch job behaviour usually makes the machines unusable for interactive work.

In terms of evicting running jobs without loosing progress, have you looked at Jaryba's SmartSuspend technology (http://www.jaryba.com/)? It's built for exactly your problem (and industry actually). It's a per-program virtualization wrapper. So you run your EDA tools under SmartSuspend and it integrates with Condor and when Condor wants to preempt a job, SmartSuspend actually checkpoints the process-level virtual machines *and* interacts with FlexLM to ensure the licenses the job was using are released. This gives you nearly lossless preemption. When your preempted job resumes it'll pick back up nearly, if not exactly, where it left off. So you don't lose all the forward progress time. Neat tech that's been around long enough that it's well proven. 

3. I have tried to used quota group see (https://www-auth.cs.wisc.edu/lists/condor-users/2011-July/msg00060.shtml) & https://lists.cs.wisc.edu/archive/condor-users/2011-November/msg00117.shtml & https://lists.cs.wisc.edu/archive/condor-users/2011-November/msg00184.shtml without any luck.
Hang tight there. I think you'll be able to revisit the hierarchical quotas in 7.6.5. Alternatively you could try the 7.7.x line -- the fix might be on there already. I've run production systems on the development Condor line in the past and it's generally pretty good.

Good luck!

Regards,
- Ian 

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing