[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Delay on submit, and other newbie issues



> I've shorten it to 3 sec, and it still seems like it's 10-15
> seconds until a job starts. I'll clock it more closely and
> see what I find.

That's really about as good as it gets for a freshly negotiated job. If
you're submitting like jobs in clusters and machines don't need to be
re-negotiated, you can shave a few seconds off.

For our really fast jobs we use an additional layer of clustering: we
queue up our really fast jobs (jobs that run anywhere from a few seconds
to a few minutes) in their own queue and we submit executor jobs to our
Condor pool. These executor jobs pull jobs from the fast queue. Since
it's all done with very little regard to constraints the pull approach
overhead is small (a second or two at worst) compared to Condor's
overhead. The hybrid approach really does give you the best of all
worlds.

> also, I see jobs submitted from other hosts are not in the
> same "cluster"? I thought all my queue entries are submitted
> to the same pool!

They're all negotiating for the same machines but each schedd is its own
queue and each queue has its own cluster counter.

> I can't seem to use condor_rm to manage the queue if I'm not
> logged on to the same machine (unless I missed something
> again). It's very confusing that job monitoring and
> management is not symetric like that.
> I'm not sure why this is, but more importantly, how do I fix this?

condor_rm -name <name of the schedd machine> <cluster>

Will let you remove jobs on a remote schedd. But you can usually only
remove *your* jobs unless you've set the security pretty open.

> > I'll let someone else field the FlexLM question in more detail.
>
> Thanks for all your help so far!

I don't know where to start with this one other than to say: Condor
doesn't know about FlexLM and co-ordinating license use based on some
sort of FlexLM integration across multiple schedds can be...not fun.

If you were to simplify a bit: use a single schedd for your site. You
could submit jobs that use a particular license to the system on hold.
And then use a cron process that monitors the queue for held jobs that
require a particular license and release them according to your current
FlexLM counters for that license. It's an imperfect sol'n though and it
gets absolutely hairy if you have a job that uses more than one limited
license.

The really simple solution is to assign particular machines to run jobs
that need certain license. It's a bit like pseudo-node locking the
licenses. So I have 5 machines in my pool that prefer jobs that are
using Synplify. If there are jobs in the queue that need Synplify
they'll only run on these 5 machines. So now I know that there will
never be more Synplify jobs running than I have Synplify licenses. And I
can ensure that if a Synplify job wants to run it does right away so my
Synplify licenses are always in use. For mixed limited license jobs we
might have a special flag the jobs use and a few special machines be we
really try to discourage people writing flows that use multiple limited
license jobs. Instead they should be writing DAGs where each node uses
only license limited tool. So you "do A with tool A, then do B with tool
B, etc.".

That simple solution has worked well for us here at Altera for a few
years now. As long as your license counts aren't changing frequently
it's a one time pain to set up and then it runs fairly smoothly.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.