[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problems with power outage etc



Peter,

It sounds like your infrastructure is a little bit fragile. Without
knowing more about it or your jobs, here are my suggestions for
mitigating lost compute time:

* Checkpoint jobs by relinking against the Condor libraries[1] (for
standard universe jobs) or using a third-party wrapper[2] (for vanilla
universe jobs).

* If the power outages are brief, a UPS might work for your small
cluster. If you can't get a enough battery to support the entire
cluster, you can put a subset of nodes on UPS an use a custom classad to
indicate which ones have battery. You can have your higher-priority jobs
prefer the UPS'ed hosts.

* If the power situation is unmanageable, you might consider running on
another resource (e.g. Open Science Grid, Amazon EC2)

[1]http://research.cs.wisc.edu/htcondor/manual/v7.8/2_4Road_map_Running.html#SECTION00341100000000000000
[2]http://dmtcp.sourceforge.net/condor.html


Hope this helps!
BC


-- 
Ben Cotton
Purdue University