[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Reboot the submit machine and not restart jobs?



Sometimes the restarted DAGMan job does not restart properly,
the DAG structure gets freezed at the time of reboot and no new jobs are submitted.
I could not reproduce the problem for sure but it does happen quite often.

Cheers,
Szabolcs




*********** REPLY SEPARATOR  ***********

On 1/12/2006 at 8:42 AM Matt Hope wrote:

>On 1/11/06, Finch, Ralph <rfinch@xxxxxxxxxxxx> wrote:
>> condor -version
>> $CondorVersion: 6.7.13 Nov  7 2005 $
>> $CondorPlatform: INTEL-WINNT50 $
>
>given this
>
>> Occasionally after submitting jobs from my machine, I need to reboot my
>> machine (it's windows, after all).  However, this means all my jobs must
>> restart (since this is windows, it is only the vanilla universe and they
>> are not checkpointed).  Since the jobs take a few hours to complete, I
>> was wondering if it's possible for the jobs running in the pool on other
>> machines to not restart, but simply reconnect with new condor_shadows
>> when my submit machine comes back after reboot.
>
>The answer is yes, but only if the startd's are also 6.7 series and
>are configured to enable leasing.
>
>http://www.cs.wisc.edu/condor/manual/v6.7/2_15Special_Environment.html#SECTION003154000000000000000
>
>A proviso on this for windows is that if you shutdown normally then
>the lease will not happen - it will trigger an eviction. To make it
>work on shutdown you need to hard kill your condor subsystem (pskill
>is going to be your friend here) as if your machine reset without
>warning.!
>
>I dislike this behaviour intensely (especially given windows likely
>use of more vanilla non checkpointing jobs) - since if you have some
>jobs which *can* checkpoint and others which can't then you're SOL.
>
>When my farm goes to 6.8 I can see this being the most common "why
>can't I do this?" request I get.
>
>Matt
>
>_______________________________________________
>Condor-users mailing list
>Condor-users@xxxxxxxxxxx
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users