[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Backup guidance



Thank you everyone for your thoughts! There is a lot of good wisdom in the replies. Our jobs will be vanilla universe, but partitionable. Reconstructing the running jobs from a backup could be done by inspecting the job files, or using the information Jaime gave. We will probably also want to use the high availability options. This gives us some options and things to think about.

Thanks again,
  Nathan

------------------------------------------------
Nathan Sharp
Phoenix Integration Inc
www.phoenix-int.com





----- On Jul 19, 2018, at 10:49 AM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
On Jul 18, 2018, at 7:06 AM, Nathan Sharp <nsharp@xxxxxxxxxxxxxxx> wrote:

Hello all,
 We are investigating HTCondor for some potential production uses and so far been having great success! A huge thanks to all who have helped create all the online documentation on HTCondor.

 So far I have had not very good luck finding information about how to properly back up an HTCondor system to protect against system failure and data loss. Is there information on this available somewhere? Can the execute, spool, log, and lock folders be backed up live, or do we need to shut down the daemons to get a clean state? Are there other pieces that need to be backed up with the files?

The important data is in the spool directory, and the two daemons that keep persistent state there are the negotiator (accounting data of past resource usage by users) and the schedd (job queue, history of past jobs, and sometimes job data files). 
Shutting down the daemons during a backup is always going to be the safest option.
The accounting, job queue, and history files are written as transactional append-only logs to allow for graceful recovery from a failure. Backing up these files live should work pretty well. Depending on the order of copying the job queue and history files, you may end up with jobs that appear in the history twice or not at all after a restore.

With any backup scheme, things will get funny if the system runs for a while between the backup and the restore. Jobs that previously completed and left the queue can come back to life. Jobs submitted after the backup no longer exist and new jobs with the same job ids will appear. This can quickly confuse job-related state that lives under the userâs control (job data files, job event log, etc).

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/