Re: [Condor-users] Intermittent Condor startd crashes

Hey Craig,

> Thanks for reporting this, Ian. I was beginning to go crazy
> trying to
> track down a similar issue.

You may not be going crazy. There's some sort of race condition in pipe
cleaned up Windows when you're using hooks. That's for certain.

> Interestingly, I've been having problems with Condor 7.3.1 on an 8
> slot OS X 10.5 machine. I haven't been able to figure out why
> this one
> machine would have slots fail after running a job, while all the
> others (2 slot) using the same binaries and configuration ran fine.
> Condor also fails to exec daemons when using condor_restart or
> condor_off followed by condor_on when using 8 slots.
> Reducing the number of slots to 4 makes everything run fine.

I can still make my <8 slot machines fail with hooks. It just takes a
little longer to hit the problem. And my failure is definitely only on
Windows code that cleans up pipes. But using 8 slots of more gets me to
the problem very quickly. The error is in deamon_core.cpp, specifically:

#if defined(WIN32)
// If Close_Pipe is called on a Windows WritePipeEnd and there is
// an outstanding overlapped write operation, we can't immediately
// close the pipe. Instead, we call this function in a separate
// thread and close the pipe once the operation is complete
unsigned __stdcall pipe_close_thread(void *arg)
        WritePipeEnd* wpe = (WritePipeEnd*)arg;

        dprintf(D_DAEMONCORE, "finally closing pipe %p\n", wpe);
        delete wpe;

        return 0;

That 'delete wpe' call is the one crashing the startd. Since it's only
called if it's a Windows binary this might not be related to your crash.

- Ian

