[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor with multiple sites over a WAN link.





Daniel Pittman wrote:
G'day.

I hope these questions are not too frequently asked; I have looked through the
manual and other documentation without finding an answer.

To give some background, we are starting to deploy Condor to help run a
variety of batch processed reporting and data manipulation jobs, most of which
are relatively small: no more than a few CPU-hours of run-time.


We have two major sites connected via a WAN link, and we would like users to
be able to run jobs that span more these two sites — mostly for historical
reasons, such as "data extraction can only happen of machine A, processing
only on machine B", where A and B are at opposite sides of the link.

The WAN link is always available and quite reliable, although somewhat
bandwidth constrained,


So, my understanding is that this deployment is probably best served using a
single Condor master, and having our submission and execution machines all
talk to it over the local or WAN link respectively.

You could also use flocking and have a central manager at each site. This would eliminate the flow of execute node ads across the WAN, though I doubt that's really a big concern.

The main disadvantage of flocking over a single pool is that there is no way to express cross-pool preferences. Matchmaking takes place independently in the two pools and whichever pool happens to come up with a match first will take the job. However, there is a built-in preference to run in the local (default) pool of the schedd, because it will only try flocking if it has some jobs that have been rejected by the local central manager. This is a somewhat weakly enforced preference, because new jobs that show up once the schedd has decided to try flocking will be eligible for flocking, even though they may not have been considered yet in the local pool.

Assuming that is the case, what sort of traffic is this likely to generate?
I believe that this is just the class-ad transmission via UDP every five
minutes or so, plus the data transfer for individual jobs.

The class-ad transmission and job data transfer are the main sources of traffic. If you are using strong security to authenticate Condor communications, be aware that this is sensitive to network latency, because of round-trips in the protocols. For this reason, we use SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION in our cross-Atlantic Condor (glidein) pools. That solves authentication latency problems in the schedd. The other place where it hurts is in the collector. We have also used a two-tier collector in order to scale to over 20,000 slots. See the How-to on multi-tier collectors for details.


Within this model we want jobs to prefer to run in their local site, and only
to run at the site across the WAN link when mandatory.

I believe the NEGOTIATOR_{PRE,POST}_JOB_RANK configuration, together with a
ClassAd identifying location, is the right tool to use here, as documented in
the admin tips and tricks: http://nmi.cs.wisc.edu/node/1479

Specifically, we use the PRE version to rank the local site higher, which will
send the job to a machine there, and use the RANK in the job itself to work
out which one to use.

Is that correct, or do I need to do something more complex with the START
expression to enforce this rule?

Your plan sounds good to me. We use NEGOTIATOR_PRE_JOB_RANK for site preference in our cross-campus condor pool. It works.

--Dan
Given that some of these jobs can generate 100 to 300 MB of (post-compression)
output data for a 30-CPU-minute runtime, the cost of waiting for a free slot
at the local site will often beat out the cost of data transfer across the WAN
link.

It *IS* mandatory that some jobs run on machines on the other side of the
link, though, so I can't just set START to only accept the same site as the
submitting machine.


Thanks in advance for your time,
       Daniel
- ✣ Daniel Pittman ✉ daniel@xxxxxxxxxxxx ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons
   Looking for work?  Love Perl?  In Melbourne, Australia?  We are hiring.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/