[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] startd hangs when using job hooks



On Tue, Feb 09, 2010 at 10:04:16AM -0500, Matthew Farrellee wrote:
> Michael Moore wrote:
> > I am trying to implement a set of fetch and prepare hooks. However, when 
> > testing the hooks I experience hangs of condor_startd. When startd hangs 
> > it quits responding to requests and condor shutdowns. Only a process 
> > level kill ends the process.
> > 
> > The host running the hooks is a Windows Vista host running Condor 7.4.1. 
> > The prepare hook does take some time to run (on the order of minutes). 
> > However, startd does not always hang during the prepare hook. Sometimes 
> > startd hangs after the job begins executing, sometimes it doesn't hang 
> > at all.
> > 
> > Has anyone else seen similar behavior? Was there a way to work around 
> > the problem? Apparently, there was a similar problem in 7.3.2 and prior 
> > where a very simple fetch hook would cause startd to hang. I haven't 
> > figured out what portion of the hook triggers this behavior, it's very 
> > intermittent.
> >  
> > Thanks,
> > Michael Moore
> 
> A few issues with hooks on Windows...
> 
> http://condor-wiki.cs.wisc.edu/index.cgi/search?s=hook+windows
> 
> Specifically...
> 
> http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=422
> http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=864
> 
> Do either of those sound like your problem?
> 
> I believe one of those is related to using Windows on a machine with 
> many CPUs -- or at least it is more reproducible there.
> 
> Best,
> 
> 
> matt

Matt,

Ticket 422 is the previous issue I mentioned above. I did test to make 
sure I wasn't seeing that issue but it seems to be correctly resolved in 
my testing. The second issue may exist but I don't get that far. 
startd will hang before the job completes. ITicket 864 is not the 
issue I'm seeing. A good way to describe it is the same symptoms of 
ticket 422 but the issue is not as reproducible and not caused by the 
simple case provided in that ticket.

I can confirm I see the issue when I force the number of slots to 1. I 
don't know about the level of reproducibility. 

Thanks for the help!

Michael