[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Notification of Cluster Complete - notprocesscomplete



On Thu, 29 Jul 2004, Jaime Frey wrote:

> On Thu, 29 Jul 2004 Robert.Nordlund@xxxxxxxxxxxxxxxx wrote:
>
> > I was under the assumption that if I submitted jobs and my submit machine
> > died, I lost all connection to the running jobs and jobs yet to be
> > scheduled.  Is this the case or am I completely misguided?
>
> If your submit machine crashes, any running jobs are killed on the execute
> machines. When the submit machine restarts, the jobs will be marked as
> idle and Condor will attempt to acquire new machines on which to restart
> them. Since scheduler universe jobs just run on the submit machine,
> they're restarted immediately. Therefore, all your jobs need to be able to
> deal with dying in mid-execution (or just after completion) and afterwards
> being restarted.

One last note: Starting with Condor 6.7.0, jobs can continue to run when
the submit machine crashes. The execute machine will let the job continue
to run and waits for the submit machine to reconnect. You can find more
information in section 2.13.4 of the Condor 6.7 Manual.

+----------------------------------+---------------------------------+
|            Jaime Frey            | I stayed up all night playing   |
|        jfrey@xxxxxxxxxxx         | poker with tarot cards. I got a |
|  http://www.cs.wisc.edu/~jfrey/  | full house and four people died.|
+----------------------------------+---------------------------------+