[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Deployment architecture advice



Hi Ivo

We're a predominately Windows HTCondor site.

Windows exec nodes, windows submit nodes, but with linux central managers (collector/negotiator),
and "some" linux execute nodes.

We are spread across all states in Australia, so have one pool per state, with flocking enabled.

Approx 10,000 slots on mainly desktops. We used to have each condor user's desktop configured
as a submit node. This can get awkward to look after. It also limits the amount of jobs that a single
user can queue/have running at once due to the limited resources of a desktop.

We now have dedicated windows submit nodes. These are all VMs running on our virtual infrastructure
and are all running win2008 server, 8-core, 32Gb RAM, and varying amounts of disk space.
Each of these submit nodes can easily handle 5-6,000 running jobs with 100,000 in the queue.
Our entire organisation uses AD so users can just logon as they would their desktops.

We don't use the condor file transfer mechanism. We have all user jobs as dos batch files. These
batch jobs are then responsible for connecting to fileserver shares, downloading executables and
input data, running the executable, uploading output files to the fileserver/s, cleaning up (deleting)
files, and exiting. They are then "self-contained" and can easily be tested outside of condor, or even
used with other systems, e.g. Windows HPC Clusters.

We're currently looking at and testing deployment of linux VMs within the virtual infrastructure
of our data centres (VMWare, vRealize, vrops, esx servers, etc.). These would be started/stopped
based on esx server loads.

For the first 3 months of this year the system has run 6,834,078 jobs for a total of 635.01 CPU years
of single core compute time.

I guess my 2 main points of advice would be:

1. dedicated submit nodes
2. "self-contained" jobs, i.e. batch files

Cheers

Greg

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ivo Cavalcante
Sent: Friday, 14 April 2017 8:10 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Deployment architecture advice

Hi all,

This is a somewhat lengthy mail, so if you're running out of time, maybe better skip it. Advice given,
let's go to the problem!


We're in the process of changing our resource management system to HTCondor, but I'd like some advices from
more experienced users, specially concerning the architecture we've decided upon. See, we're not only
moving to Condor, but also switching from a shared filesystem style process to a distributed one, using
file transfer mechanisms already in place in HTCondor. The results have been impressive so far, since
our network/storage infrastucture has long been a bottleneck. So, to the details.

1. Our software used to prepare the datasets to be processed directly on shared filesystem (NAS), what
used to take a long time. So we've changed to using workstations local disks on the process of generating
datasets, what gave a great improvement on time spent. OTOH, we had to move this data into a place where
execution nodes could see them, and decide to use their local disks also - since shared NAS could be a
bottleneck again. The way we were able to implement this architecture in Condor was, then:

- a SCHEDD daemon per submitting node - since we needed to transfer files, this was the only way we could
 find to allow SHADOW to reach local files (since it's spawned by SCHEDD). We tried to use a centralized
 SCHEDD, but the spawned SHADOW couldn't reach the files (seems pretty obvious to me, but we tried anyway).

- Centralized COLLECTOR and NEGOTIATOR (no news here).

- Multiple distributed STARTD (execute node - no news also).


So, if we deploy this for real - we're actually testing with very few users - I'd have around 50 SCHEDD - what
means 50 different queues - to manage. Even though doesn't seem much of a hassle to do so, it's a bit different
to the mainly centralized architecture we had previously. Also, since we're new to HTCondor, we have no ideia
if this is an usual/good approach. We've been studying for the last three/four months and tests are going
really well, but I'd like some kind a validation before adopting this for real.

2. A problem we found with this schema so far happened during one our user's "real test", where he was
assessing performance of the two solutions (previous and new one). One of the aspects where the current solution
fails us is it's highly sensitive to network fluctuations. We know that the "final" solution is fixing network
itself, we have several different buildings in our company, instabilities with our electrical supplier and can't
put all infrastrucure under power generators and/or UPS, as it'd be too costly. We would be glad if any solution
could endure a slight "blink" on the network, like 5 minutes or so. With this in mind, our user proceeded to
the PNCT - pull the network cable test - while submitting from his workstation and, surprisingly, everything
appeared to be fine after he connected cable again. Already running jobs kept running and new ones were queued
nicely, even when cable was out, since SCHEDD is running locally. Unfortunately, he noticed that the new jobs
kept queued forever, and running ones "died" after some time, even though they were actually running - we've
a web system where users can check state of jobs, consulting centralized server directly.

As it turned out, problem was that he connected the cable to a DIFFERENT (by a fortunate mistake) port then
previous one, so outgoingconnections were going through a different IP address than the one advertised
previously by Condor and recorded in COLLECTOR. I tried invalidating the ClassAd related to this SCHEDD and
managed to do it, only to see it appears again, same old invalid IP address. Long story short: I couldn't
reach this machine through any of native Condor tools, since they use address stored on COLLECTOR instead of
the new one. We were using "condor_q -global" on some script to gather information from the pool and this
broke too, since the tool appears not to be fault tolerant - whenever it tried to reach that SCHEDD and failed,
it would print an error message and leave. That's why I said the error was fortunate, since it showed me I
shouldn't rely on "-global" switch to reach the SCHEDDs and try to go to each one manually. This problem
was one of the main reasons I decided to come here first, before attempting a more general deployment.

FTR, our submitting nodes are running Windows, central server is Linux, so are executing nodes - by now,
since we're looking forward to take some juice from our Windows machines too. We're running HTCondor 8.6.1
except on credential server, since we found an error with case-sensitiveness of Windows user login (ticket
already opened:Âhttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6200); so we've sticked with 8.4.11
for CREDD by now.



All that being said, I'd really appreciate any advices relating how to achieve what we're attempting - that
is, using file transfer with HTCondor, with multiple submitting nodes (each user submits from it's own
machine). Seems a pretty simple task, but I learned to fear those.... Â Â :-)




Thanks in advance!
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/