On Thu, Oct 7, 2010 at 3:14 AM, Matt Hope <Matt.Hope@xxxxxxxxxxxxxxx>
I am currently thinking over how to work around the limitations
of the windows schedd/shadow structure.
Using windows server 2003 64bit and tweaking the registry for a
few things we can only stably run 200 jobs per submit node, which is a real pain.
Thankfully running it in a VM appears to be acceptable so we’ve pretty
much headed towards per USER submit VM’s
One bare metal, with some exceptional nice hardware, I've gotten it up to 150 running jobs/schedd with 4 schedds on the box. And even that feels meta-stable at times. The lightest of breezes pushes off balance. This is Win2k3 Server 64-bit. AFAIK there are no registry tweaks to the image. What tweaks are you making?
o In the
hive mind opinion should I not consider even testing using job hooks (for replacement
of schedd/negotiator) on windows right now?
Well, Todd closed that ticket. I swear it's never worked in 7.4.x for me but I have retest now and confirm this. It can't hurt to try it. But you loose so much with hooks for pulling jobs. You have to do your own job selection, or if you let the startd reject jobs, you have to have a way to pull and then put back jobs that have been rejected, which is inefficient and difficult to architect such that it works well when you've got a *lot* of slots trying to find their next job to run. I'll admit this exactly what I had working at Altera but it was a good year plus of work to get it functioning.
Multiple per user daemons per box
doubt this would actually improve things
It does assuming you've got enough CPU so the shadows don't starve on startup. That's one area where I notice Windows falling down quite frequently: if you've got a high rate of shadows spawning it seems to have a lot of trouble getting the shadow<->startd comm setup and the job started. Lately I've been running with a 2:1 ratio of processor/cores to scheduler daemons.
o Also not
clear if anyone uses this heavily on windows
All the time.
Remote submit to linux based schedd’s
submission is ultimately a bit of a hack, and forces the client side to do a
lot more state checking
I have mixed feelings about remote submission. It gets particularly tricky when you're mixing OSes.
I've had better luck with password-free ssh to submit jobs to centralize Linux machines. And the SOAP interface to Condor is another approach I've had better experience with than remote submission from Windows to Linux. Not to say it can't work, just that I've found it tricky.
Another option is to run a light-weight "submitter" daemon on the scheduler that you have a custom submit command talk to over REST or SOAP and it, in turn, is running as root so it has the ability to su to any user and submit as them, on that machine. Might be easier than ssh.