[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How to migrate running jobs with local checkpoints?



Hi Carsten,

> Hi all,
> 
> I have the following problem:
> 
> All 4 slots of a machine are currently used by users (all standard
> universe jobs). However, the hard disk on the system reported that it
> might fail very soon. Thus I would like to migrate the jobs to another
> machine and don't lose their 20h+ run-times.
> 
> But since local checkpointing is in effect, I don't know who to proceed.
> 
> It is possible to just issue
> 
> condor_off -startd -peaceful n0066
> 
> and then somehow copy the checkpoint file over to another node? How
> would condor recognize this and use this particular node for the jobs?

I would do this:

- condor_off as you indicate above to allow the jobs to
  checkpoint again
- condor_prio to force those jobs to start running again
  elsewhere 
- once you see the jobs running elsewhere, use
  condor_checkpoint to force them to checkpoint locally to
  their new local checkpoint server

At that point you should be able to take the node offline
without losing any work.

Others may have more elegant solutions, but this is what I
have done in the past.

Scott




> 
> Sorry if this is a dumb question.
> 
> Cheers
> 
> Carsten
> -- 
> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
> Callinstrasse 38, 30167 Hannover, Germany
> Phone/Fax: +49 511 762-17185 / -17193
> http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/