[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Elastically extend local condor pool by EC2 instances



Hi Jens,

I'm relatively new to HTCondor, and I have no clue about the Windows-side of HTCondor, but I have been working on a similar problem for the past few months. So I can make a few comments and suggestions here that you might find useful.

I would say the advantage of running a separate cluster in AWS and flocking over to it would be that it may better isolate the system from network issues. If you're running worker node instances that connect back to your external pool, you'll be assuming this connectivity risk, which may or may not be important to you. Note, however, you will also want to consider that AWS charges for out-bound data. So depending on your typical user job profile, you may also want to keep user job output in the cloud by default somehow, and make it a more manual process for the user to transfer out data, taking only what they really need. But right now, I'm not entirely sure how were going deal with this issue myself. 

I don't know if you can connect Linux instances back to your Windows pool, but we're testing a tool developed by the HTCondor team called condor_annex [1], which allows you to manually order up EC2 instances that then connect back to your external pool. If you're interested in this, I can pass along the secret sauce you would need to bake into your instance images to get this working. I need to write something up on this soon anyway. We're in the middle of working on how to provision the condor_annex instances automatically based on user job demand, so this might all be more auto-magic in the near future. 

If you are familiar with glideinWMS and want a fully-automated solution for ordering up EC2 instances now, you could setup a glideinWMS system to request glideins be submitted to a cfncluster [2]. cfncluster will automatically spin up instances based on the number of submitted glideins to its local batch queue. This would be quite a bit to setup, but it might work nicely if tuned up correctly. It would also be nice if cfncluster supported HTCondor out-of-the-box, but until then, glideins are the only way I can see to work around this. 

Marty


[1]

https://github.com/htcondor/htcondor/tree/V8_5-condor_annex-branch

[2]

https://github.com/awslabs/cfncluster


________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Jens Schmaler [jens.schmaler@xxxxxx]
Sent: Saturday, April 23, 2016 10:01 AM
To: HTCondor-Users Mail List
Subject: [HTCondor-users] Elastically extend local condor pool by EC2   instances

Hi all,

we are successfully running a local *Windows-only* HTCondor pool.
However, there are times when there is need for more computing power
than our pool can provide. We thus want to extend our computing to some
commercial cloud service, preferably Amazon.

While I did find quite some information on combining HTCondor with
Amazon EC2 instances online, I am still rather confused about what is
currently considered best practice that I can expect to work without
major drawbacks. I hope that there might be some experts on this list
who are willing to give advice on this.

My ideal solution would be as follows:

Users continue submitting jobs from their local machine as they are used
to. They do not care about where a job ultimately runs. If it cannot run
on our local cluster, additional EC2 worker nodes are somehow magically
started up and "join" our pool, and the job is executed there without
the user even needing to know about it.

I have read through the documentation, and here are some thoughts that I
got from there:

- Since users do not care about where their job runs, they should
continue creating their vanilla universe jobs and not have to fiddle
with the grid universe themselves.

- Since we are completely Windows-based, the grid universe anyway only
seems to allow another condor pool as grid type, right? This probably
means I cannot directly use HTCondor's EC2 functionality.

- I could run a second cluster on Amazon and use the flocking mechanism
to execute jobs there which cannot run locally. However, I have no idea
how I could make the cloud cluster elastically start up and expand
according to the current needs. Also, what would be the pros and cons of
running another cluster in the cloud vs. adding cloud worker nodes to
our own cluster?

- Generally, I like the idea of the glidein-functionality where external
ressources appear as part of the local cluster, but I have not seen this
in conjunction with EC2.


As you see, there are many unclear points in my side, and I would really
appreciate any help to clear my mind. Has anyone done something similar
and can comment on their strategy?

Thanks a lot in advance,
Jens









_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/