Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Queueing ssh sessions?

Date: Mon, 11 May 2009 13:25:28 -0400
From: Ian Chesal <ICHESAL@xxxxxxxxxx>
Subject: Re: [Condor-users] Queueing ssh sessions?

> :Technically you could use condor_qedit to edit the concurrency_limit
> :settings on a job. How well that would work: I can't say.
> Definitely you
> :can use condor_config_val to manipulate the max limits stored on the
> :negotiator in your system.
>
>  condor_config_val  is likely what I'm looking for, if it comes to
>  that thanks.
>
> :But why do you need to manipulate these? Why
> :wouldn't you just submit your interactive job with the
> concurrency_limit
> :counters it wants to consume and then let Condor decrement
> and increment
> :them when the job is executed?
>
> Ideally this is what I would want, yes.  The command line option is a
> bit of a kludge to fall back on if I need to.
>
> I have two "needs" for this.
>
> First people are asking for an interactive commandline session to any
> available node.
>
> Second I'm trying to get Matlab Parallel Computing Toolkit to work.
> If you're familiar with it we've been using the "distributed" half for
> some time but the "parallel" (mpi'ish) half isn't supported with
> Condor (yet?) and I'm trying to make that go for one of my users.  The
> have an ssh based example, which I have working but not through the
> queue.
>
> I'd like to have it work through the queue (either porting it to the
> parallel universe or using the ssh example), but even more than that
> I'd like to be able to have the licenses in the Condor concurrency
> limit match howmany are availble.  If everything goes throught he
> queue liek it does with the distributed jobs now, it works
> beatifully.  If people start running it outside the queue and the
> concurrency limit isn't adjusted jobs will fail in bad ways when it
> starts instances that it doesn't have licenses for (I think they
> shoudl handle this better within Matlab, but Mathworks thinks
> everything should be pushed off on the scheduler)

Ahh. Okay. I see what you're doing now. You've told Condor you have X
licenses, but you want to change it to X-1 if someone consumes a license
by running "tool.exe" at their prompt. Is that it?

I can tell you know: this is a nightmare support situation. Especially
if you've got a lot of different tools. But particuarly with MatLab
because the actual license consumption model used by MatLab is not
readily modelled with Condor's simple counters. MatLab is a nightmare in
this regard.

I've been around and around this problem here at Altera and there really
is no optimal solution that makes the best use of licenses, compute
hardware, and user's time. Mixing direct tool access with
through-the-compute-farm access is a really tough problem. There's no
optimal solution that gives you the best license use, the best machine
use and the best user time use. You have to pick one of those two and
optimize for those. We always optimize for license use and user time --
I'm willing to burn compute farm nodes because they're usually cheaper
than the tools and engineers.

For example: all my users queue for access to tools. Some jobs run the
tools in batch mode, some in interactive mode. In interactive mode the
user may not actually be executing the tool binaries 100% of the time so
FlexLM may not report the tool as "in use" but I still don't try and use
those inbetween use cycles for batch jobs because it's too expensive to
have the engineer try to run the tool they *thought* they had claimed
only to find out the system gave away the license while they were
getting a coffee.

Where I can I have my batch jobs use the FlexLM-side license queueing
and I over allocate my "licenses" in my system for batch jobs. This gets
me closer to 100% tool utilization for batch jobs. We don't put rigid
controls on the batch jobs that use tools so overallocating helps deal
with the case where a user submits a job that claims a licensed tool
resource but then runs a job script that does a lot of other
computational work before/after running the actual licensed tool binary.
In this case I'm willing to have the batch job take longer and hold the
machine waiting for a license from FlexLM because the CPU time is
cheaper in comparison to the tool license.

Balancing all this stuff is hard. It's an ongoing struggle. Keep it as
simple as possible is my best advice. :)

> :As for running interactive jobs: I do this in my pool. Users can run
> :VNC, NX, XTerm and XDMCP jobs on machines.
>
> I'm curious how you have this configured.  How do users know what host
> to connect to with VNC?  How is the local display made available,
> xhost + seems the obvious but scary choice , or possibly using ssh
> port forwarding setup on the execute node...

Users start these sessions through a web interface that eventually
submits a Condor job for them. If they choose VNC the web interface can
detect if they have VNC started on their machine in listen mode. If it's
not it gives them instructions for starting a listener. The job is run
with the address and port of the listener and the VNC server just opens
the session for them on their desktop when the job runs.

NX is a little more complicated. The job runs and then tells the user
(through our web interface, but you could just as easily send them an
email) how to connect to the NX session with the NX client on their
machine. NX (at least the free version) doesn't have a listen mode for
the client. But still: it's just a fast and elegant protocol that it's
worth the hassel. We're fazing out VNC support in favour of NX. NX
sessions, especially over oceans, are very fast. I really can't say
enough nice things about NX -- works over ssh, secure, you can
disconnect and reconnect to sessions like you can with VNC. It's slick.

Users spawning X-based sessions are expected to have run "xhost +" on
their machine to allow open access. This isn't a big problem when you're
in corporate environment like I am. I could see why you wouldn't want to
do this in an open school lab. X is the worst of the protocols -- slow
and cludgy.

Hope that helps.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

Follow-Ups:
- Re: [Condor-users] Queueing ssh sessions?
  - From: Jonathan D. Proulx

References:
- [Condor-users] Queueing ssh sessions?
  - From: Jonathan D. Proulx
- Re: [Condor-users] Queueing ssh sessions?
  - From: Ian Chesal
- Re: [Condor-users] Queueing ssh sessions?
  - From: Jonathan D. Proulx

Prev by Date: Re: [Condor-users] Queueing ssh sessions?
Next by Date: Re: [Condor-users] Queueing ssh sessions?
Previous by thread: Re: [Condor-users] Queueing ssh sessions?
Next by thread: Re: [Condor-users] Queueing ssh sessions?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Queueing ssh sessions?