[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Fault Behaviour of Condor



Hello Todd,

Thanks for the explanation.  We too are experiencing a similar problem in our lab.  Is there a work-around?

As somebody already pointed out in the list, if the shadow process dies, condor attempts to restart the job on a different execute host immediately.

As a stop-gap arrangement (until this behavior is improved), we are planning to use cluster monitoring tool to detect execute host failures and kill the shadow processes corresponding to the failed execute host.  Does this plan seem OK?

Regards,
Sateesh

Todd Tannenbaum wrote on 08/15/2006 04:01 AM:
At 10:57 AM 8/2/2006, Matt Hope wrote:
  
On 8/2/06, thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx
<thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx> wrote:

    
3.) Shutting down the NIC on the executor (I assume same as pulling the
plug)
Outcome: Condor hangs, a shadow process is existing all the time
I even cannot remove the job with condor_rm!
Maybe a bug? what can I do?
      
condor_rm -forcex may get rid of it (you may need to kill off the
shadow by hand, it should eventually timeout though, how long did you
give it?).
    

Here is the story:

Condor sends regular "pings" from the submit machine (schedd) to the 
execute machine (startd).  Thus the execute machine knows relatively 
quickly if a submit machine disappears (because it will not receive a 
ping within the anticipated timeframe).  You can configure how often 
these pings happen and how quickly the Condor execute machine will 
throw off a current job by tweaking the job lease parameter in the submit file.

However, going the other way is a different story. There is currently 
no way to configure how often "pings" happen from the execute machine 
back to the submit machine.  Thus, there is no way to configure how 
long it takes before a submit machine (the condor_shadow) notices 
when a execute machine falls off of the network.  Have no fear, 
however, as Condor most definitely *will* notice it eventually, but 
you may need to be patient.  The socket created by the condor_shadow 
is using TCP's KEEPALIVE option on the socket.  However, the standard 
for TCP says it only needs to send a ping every two hours.   So in 
the worst case, the shadow may take up to a max of two hours to 
notice if the execute machine has fallen off the network without a 
trace.    This is an issue we'd like to improve.

Hope this helps clarify things,
Todd




-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
Phone: (608) 263-7132  FAX: (608) 262-9777

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

  

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com