[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Backup guidance



On Jul 18, 2018, at 7:06 AM, Nathan Sharp <nsharp@xxxxxxxxxxxxxxx> wrote:

Hello all,
 We are investigating HTCondor for some potential production uses and so far been having great success! A huge thanks to all who have helped create all the online documentation on HTCondor.

 So far I have had not very good luck finding information about how to properly back up an HTCondor system to protect against system failure and data loss. Is there information on this available somewhere? Can the execute, spool, log, and lock folders be backed up live, or do we need to shut down the daemons to get a clean state? Are there other pieces that need to be backed up with the files?

The important data is in the spool directory, and the two daemons that keep persistent state there are the negotiator (accounting data of past resource usage by users) and the schedd (job queue, history of past jobs, and sometimes job data files). 
Shutting down the daemons during a backup is always going to be the safest option.
The accounting, job queue, and history files are written as transactional append-only logs to allow for graceful recovery from a failure. Backing up these files live should work pretty well. Depending on the order of copying the job queue and history files, you may end up with jobs that appear in the history twice or not at all after a restore.

With any backup scheme, things will get funny if the system runs for a while between the backup and the restore. Jobs that previously completed and left the queue can come back to life. Jobs submitted after the backup no longer exist and new jobs with the same job ids will appear. This can quickly confuse job-related state that lives under the userâs control (job data files, job event log, etc).

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project