[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Notification of Cluster Complete - notprocesscomplete






I noticed upon reboot that all the jobs that I submitted were waiting since
the original execute machines still reported being in the Claimed state.
I found in the documentation that there is a combination of settings on the
scheduler and starter that affect the state change:
MAX_CLAIM_ALIVES_MISSED defaults to 6 and ALIVE_INTERVAL which defaults to
5 minutes.  I verified in my configuration  that the amount of time it
takes for the startd's to switch state to Unclaimed is 30 minutes (6 * 5
mins).

Thanks,
Bob


|---------+-------------------------------->
|         |           Jaime Frey           |
|         |           <jfrey@xxxxxxxxxxx>  |
|         |           Sent by:             |
|         |           condor-users-bounces@|
|         |           cs.wisc.edu          |
|         |                                |
|         |                                |
|         |           07/29/2004 04:51 PM  |
|         |           Please respond to    |
|         |           Condor-Users Mail    |
|         |           List                 |
|         |                                |
|---------+-------------------------------->
  >--------------------------------------------------------------------------------------------------------------|
  |                                                                                                              |
  |       To:       Condor-Users Mail List <condor-users@xxxxxxxxxxx>                                            |
  |       cc:                                                                                                    |
  |       Subject:  Re: [Condor-users] Notification of Cluster Complete - not    processcomplete                 |
  >--------------------------------------------------------------------------------------------------------------|




On Thu, 29 Jul 2004 Robert.Nordlund@xxxxxxxxxxxxxxxx wrote:

> I was under the assumption that if I submitted jobs and my submit machine
> died, I lost all connection to the running jobs and jobs yet to be
> scheduled.  Is this the case or am I completely misguided?

If your submit machine crashes, any running jobs are killed on the execute
machines. When the submit machine restarts, the jobs will be marked as
idle and Condor will attempt to acquire new machines on which to restart
them. Since scheduler universe jobs just run on the submit machine,
they're restarted immediately. Therefore, all your jobs need to be able to
deal with dying in mid-execution (or just after completion) and afterwards
being restarted.

+----------------------------------+---------------------------------+
|            Jaime Frey            | I stayed up all night playing   |
|        jfrey@xxxxxxxxxxx         | poker with tarot cards. I got a |
|  http://www.cs.wisc.edu/~jfrey/  | full house and four people died.|
+----------------------------------+---------------------------------+
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users





*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential and/or privileged information.  If you are not the intended recipient, any use, copying, disclosure, dissemination or distribution is strictly prohibited.  If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this communication and destroy all copies.
*************************************************************************