[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] startd hangs when using job hooks



Ian,

In my case the issue mysteriously resolved itself by changing the way 
STDIN was read in the fetch script. I had been doing, in Perl, a join() 
on STDIN. When I switch to using a while(<STDIN>) and appending the 
input to a temporary string the issue went away. Unfortunately, I was 
not able to create a simple case where I could isolate what was causing 
the issue. In testing just a join() versus a while() in a fetch script 
it didn't exhibit the startd hang in either case.

As an additional note, I was seeing the exact same errors as the 
previous bug hanging startd with just the simple 'exit 0' fetch hook.

Michael

On Tue, Mar 02, 2010 at 05:05:57PM -0500, Ian Chesal wrote:
> Here's another tarball with log files. This time with ALL_DEBUG =
> D_FULLDEBUG turned on.
> 
> It's quite strange. The way my hooks are working right now, every hook
> returned a job to Condor to run. The job being just a dummy sleep-forever
> loop for testing. Condor didn't get the jobs even though the hooks said they
> returned them. And only 3 of the 8 slots on this box (I switched to a faster
> box) fired the hook fetchwork script and wrote out log files.
> 
> And once they ran once, Condor stopped running them. The machine doesn't
> show up in condor_status output and:
> 
> D:\arc\condor>condor_status -debug -direct localhost
> 03/02 14:03:52 Locale: English_United States.1252
> 03/02 14:03:52 Initialized the following authorization table:
> 03/02 14:03:52 Authorizations yet to be resolved:
> 03/02 14:03:52 allow NEGOTIATOR:  */137.57.206.237 */sj-arcdev.altera.com
> 03/02 14:03:52 allow ADMINISTRATOR:  */137.57.202.96 */sj-arcdev.altera.com*/
> sqal19-test.altera.com */137.57.206.237 */sj-sw-web5.altera.com */
> sw-cycleserver.altera.com */sj-bs5450i-380.altera.com */137.57.206.83 */
> 137.57.206.69 */137.57.2
> 06.69
> 03/02 14:03:52 allow OWNER:  */137.57.202.96 */sj-arcdev.altera.com */
> sqal19-test.altera.com */137.57.206.237 */sj-sw-web5.altera.com */
> sw-cycleserver.altera.com */sj-bs5450i-380.altera.com */
> sj-bs5450i-380.altera.com */137.57.206.83 */137.
> 57.206.83 */137.57.206.69 */137.57.206.69
> 03/02 14:03:52 allow CONFIG:  */137.57.202.96 */sj-arcdev.altera.com */
> sqal19-test.altera.com */137.57.206.237 */sj-sw-web5.altera.com */
> sw-cycleserver.altera.com */sj-bs5450i-380.altera.com */
> sj-bs5450i-380.altera.com */137.57.206.83 */137
> .57.206.83 */137.57.206.69 */137.57.206.6903/02 14:04:52 condor_read():
> timeout reading 5 bytes from collector at <137.57.
> 206.83:3868>.
> 03/02 14:04:52 IO: Failed to read packet header
> 
> Ouch.
> 
> I didn't update my ALLOW_* statements -- maybe I should revisit permissions
> for the 7.4 release?
> 
> - Ian
> 
> On Tue, Mar 2, 2010 at 3:33 PM, Ian Chesal <ian.chesal@xxxxxxxxx> wrote:
> 
> > On Fri, Feb 12, 2010 at 1:27 PM, Matthew Farrellee <matt@xxxxxxxxxx>wrote:
> >
> >> If you can get the issue to reproduce let us know and we can get a new
> >> ticket filed.
> >>
> >
> > I'm trying this now. Condor 7.4.1 on Windows XP SP1 64-bit.
> >
> > It's hung up for sure. I'm attaching the log files from the machine, no
> > debugging on right now but I can turn it on and try again if you like.
> > Config files in use are also included. The entry point is
> > config/condor_config.
> >
> > I can see the process tree as:
> >
> > condor_master.exe
> >    condor_startd.exe
> >       cmd.exe
> >          perl.exe
> >       cmd.exe
> >          perl.exe
> >       cmd.exe
> >          perl.exe
> >
> > Which is weird. I expect four cmd/perl trees because it's a 4 slot machine
> > but I only ever get three.
> >
> > My hook scripts write log data to ARCFetchWorkLog.N and one of them started
> > to put some stuff to the log and then it just stops.
> >
> > Daemon startup was at 12:03 and I grabbed those logs at 12:32 -- as you can
> > see it hasn't done much in that time.
> >
> > I made no changes to my config files from 7.2.2 to 7.4.1, but it didn't
> > appear necessary and I don't get any errors on daemon startup.
> >
> > - Ian
> >


> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/