[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problems with power outage etc



Hi

Yes you are correct that we are bit fragile. We are chaning offices at the moment which hopefully will help with the power situation.

Checkpointing sounds like a very nice idea. I will have a look at dmtcp.

A quick solution to our problem would be to set up the shadow to not restart the job, but rather stop it. We can use use restart files from our programs to restart the job manually. Is that possible?

Another question, is it possible to set up condor to run special scripts when stopping or restarting jobs?

Peter

-----Original Message-----
From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Cotton, Benjamin J
Sent: 6. juni 2013 15:06
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Problems with power outage etc

Peter,

It sounds like your infrastructure is a little bit fragile. Without knowing more about it or your jobs, here are my suggestions for mitigating lost compute time:

* Checkpoint jobs by relinking against the Condor libraries[1] (for standard universe jobs) or using a third-party wrapper[2] (for vanilla universe jobs).

* If the power outages are brief, a UPS might work for your small cluster. If you can't get a enough battery to support the entire cluster, you can put a subset of nodes on UPS an use a custom classad to indicate which ones have battery. You can have your higher-priority jobs prefer the UPS'ed hosts.

* If the power situation is unmanageable, you might consider running on another resource (e.g. Open Science Grid, Amazon EC2)

[1]http://research.cs.wisc.edu/htcondor/manual/v7.8/2_4Road_map_Running.html#SECTION00341100000000000000
[2]http://dmtcp.sourceforge.net/condor.html


Hope this helps!
BC


--
Ben Cotton
Purdue University
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/