[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Elastically extend local condor pool by EC2 instances



Hi all,

we are successfully running a local *Windows-only* HTCondor pool.
However, there are times when there is need for more computing power
than our pool can provide. We thus want to extend our computing to some
commercial cloud service, preferably Amazon.

While I did find quite some information on combining HTCondor with
Amazon EC2 instances online, I am still rather confused about what is
currently considered best practice that I can expect to work without
major drawbacks. I hope that there might be some experts on this list
who are willing to give advice on this.

My ideal solution would be as follows:

Users continue submitting jobs from their local machine as they are used
to. They do not care about where a job ultimately runs. If it cannot run
on our local cluster, additional EC2 worker nodes are somehow magically
started up and "join" our pool, and the job is executed there without
the user even needing to know about it.

I have read through the documentation, and here are some thoughts that I
got from there:

- Since users do not care about where their job runs, they should
continue creating their vanilla universe jobs and not have to fiddle
with the grid universe themselves.

- Since we are completely Windows-based, the grid universe anyway only
seems to allow another condor pool as grid type, right? This probably
means I cannot directly use HTCondor's EC2 functionality.

- I could run a second cluster on Amazon and use the flocking mechanism
to execute jobs there which cannot run locally. However, I have no idea
how I could make the cloud cluster elastically start up and expand
according to the current needs. Also, what would be the pros and cons of
running another cluster in the cloud vs. adding cloud worker nodes to
our own cluster?

- Generally, I like the idea of the glidein-functionality where external
ressources appear as part of the local cluster, but I have not seen this
in conjunction with EC2.


As you see, there are many unclear points in my side, and I would really
appreciate any help to clear my mind. Has anyone done something similar
and can comment on their strategy?

Thanks a lot in advance,
Jens