[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] How to migrate running jobs with local checkpoints?



Hi all,

I have the following problem:

All 4 slots of a machine are currently used by users (all standard
universe jobs). However, the hard disk on the system reported that it
might fail very soon. Thus I would like to migrate the jobs to another
machine and don't lose their 20h+ run-times.

But since local checkpointing is in effect, I don't know who to proceed.

It is possible to just issue

condor_off -startd -peaceful n0066

and then somehow copy the checkpoint file over to another node? How
would condor recognize this and use this particular node for the jobs?

Sorry if this is a dumb question.

Cheers

Carsten
-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/31