[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] startd hangs when using job hooks



On 02/09/2010 10:47 AM, Michael Moore wrote:
> On Tue, Feb 09, 2010 at 10:04:16AM -0500, Matthew Farrellee wrote:
>> Michael Moore wrote:
>>> I am trying to implement a set of fetch and prepare hooks. However, when 
>>> testing the hooks I experience hangs of condor_startd. When startd hangs 
>>> it quits responding to requests and condor shutdowns. Only a process 
>>> level kill ends the process.
>>>
>>> The host running the hooks is a Windows Vista host running Condor 7.4.1. 
>>> The prepare hook does take some time to run (on the order of minutes). 
>>> However, startd does not always hang during the prepare hook. Sometimes 
>>> startd hangs after the job begins executing, sometimes it doesn't hang 
>>> at all.
>>>
>>> Has anyone else seen similar behavior? Was there a way to work around 
>>> the problem? Apparently, there was a similar problem in 7.3.2 and prior 
>>> where a very simple fetch hook would cause startd to hang. I haven't 
>>> figured out what portion of the hook triggers this behavior, it's very 
>>> intermittent.
>>>  
>>> Thanks,
>>> Michael Moore
>>
>> A few issues with hooks on Windows...
>>
>> http://condor-wiki.cs.wisc.edu/index.cgi/search?s=hook+windows
>>
>> Specifically...
>>
>> http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=422
>> http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=864
>>
>> Do either of those sound like your problem?
>>
>> I believe one of those is related to using Windows on a machine with 
>> many CPUs -- or at least it is more reproducible there.
>>
>> Best,
>>
>>
>> matt
> 
> Matt,
> 
> Ticket 422 is the previous issue I mentioned above. I did test to make 
> sure I wasn't seeing that issue but it seems to be correctly resolved in 
> my testing. The second issue may exist but I don't get that far. 
> startd will hang before the job completes. ITicket 864 is not the 
> issue I'm seeing. A good way to describe it is the same symptoms of 
> ticket 422 but the issue is not as reproducible and not caused by the 
> simple case provided in that ticket.
> 
> I can confirm I see the issue when I force the number of slots to 1. I 
> don't know about the level of reproducibility. 
> 
> Thanks for the help!
> 
> Michael

If you can get the issue to reproduce let us know and we can get a new ticket filed.

Best,


matt