[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Deployment architecture advice



Hi all,

This is a somewhat lengthy mail, so if you're running out of time, maybe better skip it. Advice given,
let's go to the problem!


We're in the process of changing our resource management system to HTCondor, but I'd like some advices from
more experienced users, specially concerning the architecture we've decided upon. See, we're not only
moving to Condor, but also switching from a shared filesystem style process to a distributed one, using
file transfer mechanisms already in place in HTCondor. The results have been impressive so far, since
our network/storage infrastucture has long been a bottleneck. So, to the details.

1. Our software used to prepare the datasets to be processed directly on shared filesystem (NAS), what
used to take a long time. So we've changed to using workstations local disks on the process of generating
datasets, what gave a great improvement on time spent. OTOH, we had to move this data into a place where
execution nodes could see them, and decide to use their local disks also - since shared NAS could be a
bottleneck again. The way we were able to implement this architecture in Condor was, then:

- a SCHEDD daemon per submitting node - since we needed to transfer files, this was the only way we could
 find to allow SHADOW to reach local files (since it's spawned by SCHEDD). We tried to use a centralized
 SCHEDD, but the spawned SHADOW couldn't reach the files (seems pretty obvious to me, but we tried anyway).

- Centralized COLLECTOR and NEGOTIATOR (no news here).

- Multiple distributed STARTD (execute node - no news also).


So, if we deploy this for real - we're actually testing with very few users - I'd have around 50 SCHEDD - what
means 50 different queues - to manage. Even though doesn't seem much of a hassle to do so, it's a bit different
to the mainly centralized architecture we had previously. Also, since we're new to HTCondor, we have no ideia
if this is an usual/good approach. We've been studying for the last three/four months and tests are going
really well, but I'd like some kind a validation before adopting this for real.

2. A problem we found with this schema so far happened during one our user's "real test", where he was
assessing performance of the two solutions (previous and new one). One of the aspects where the current solution
fails us is it's highly sensitive to network fluctuations. We know that the "final" solution is fixing network
itself, we have several different buildings in our company, instabilities with our electrical supplier and can't
put all infrastrucure under power generators and/or UPS, as it'd be too costly. We would be glad if any solution
could endure a slight "blink" on the network, like 5 minutes or so. With this in mind, our user proceeded to
the PNCT - pull the network cable test - while submitting from his workstation and, surprisingly, everything
appeared to be fine after he connected cable again. Already running jobs kept running and new ones were queued
nicely, even when cable was out, since SCHEDD is running locally. Unfortunately, he noticed that the new jobs
kept queued forever, and running ones "died" after some time, even though they were actually running - we've
a web system where users can check state of jobs, consulting centralized server directly.

As it turned out, problem was that he connected the cable to a DIFFERENT (by a fortunate mistake) port then
previous one, so outgoingconnections were going through a different IP address than the one advertised
previously by Condor and recorded in COLLECTOR. I tried invalidating the ClassAd related to this SCHEDD and
managed to do it, only to see it appears again, same old invalid IP address. Long story short: I couldn't
reach this machine through any of native Condor tools, since they use address stored on COLLECTOR instead of
the new one. We were using "condor_q -global" on some script to gather information from the pool and this
broke too, since the tool appears not to be fault tolerant - whenever it tried to reach that SCHEDD and failed,
it would print an error message and leave. That's why I said the error was fortunate, since it showed me I
shouldn't rely on "-global" switch to reach the SCHEDDs and try to go to each one manually. This problem
was one of the main reasons I decided to come here first, before attempting a more general deployment.

FTR, our submitting nodes are running Windows, central server is Linux, so are executing nodes - by now,
since we're looking forward to take some juice from our Windows machines too. We're running HTCondor 8.6.1
except on credential server, since we found an error with case-sensitiveness of Windows user login (ticket
already opened:Âhttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6200); so we've sticked with 8.4.11
for CREDD by now.



All that being said, I'd really appreciate any advices relating how to achieve what we're attempting - that
is, using file transfer with HTCondor, with multiple submitting nodes (each user submits from it's own
machine). Seems a pretty simple task, but I learned to fear those.... Â Â :-)




Thanks in advance!