[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes




Hi Folks -

Thanks to all, esp to Ian, for gathering/sharing the below info. The Condor user community is the best around :). So there is definitely a bug lurking in how Condor is dealing w/ pipes on MS Windows machines. We have been gathering up info/thoughts related to this issue (ticket #422 on condor-wiki.cs.wisc.edu).

I will definitely be looking into this bug by early next week, likely Tuesday, unless someone else beats me to it.

In any event, I will send out an update to condor-users by end of the day on Tuesday.

regards,
Todd

Ian Chesal wrote:
Hey Craig,

Thanks for reporting this, Ian. I was beginning to go crazy
trying to
track down a similar issue.

You may not be going crazy. There's some sort of race condition in pipe
cleaned up Windows when you're using hooks. That's for certain.

Interestingly, I've been having problems with Condor 7.3.1 on an 8
slot OS X 10.5 machine. I haven't been able to figure out why
this one
machine would have slots fail after running a job, while all the
others (2 slot) using the same binaries and configuration ran fine.

Condor also fails to exec daemons when using condor_restart or
condor_off followed by condor_on when using 8 slots.

Reducing the number of slots to 4 makes everything run fine.

I can still make my <8 slot machines fail with hooks. It just takes a
little longer to hit the problem. And my failure is definitely only on
Windows code that cleans up pipes. But using 8 slots of more gets me to
the problem very quickly. The error is in deamon_core.cpp, specifically:

#if defined(WIN32)
// If Close_Pipe is called on a Windows WritePipeEnd and there is
// an outstanding overlapped write operation, we can't immediately
// close the pipe. Instead, we call this function in a separate
// thread and close the pipe once the operation is complete
unsigned __stdcall pipe_close_thread(void *arg)
{
        WritePipeEnd* wpe = (WritePipeEnd*)arg;
        wpe->complete_async_write(false);

        dprintf(D_DAEMONCORE, "finally closing pipe %p\n", wpe);
        delete wpe;

        return 0;
}
#endif

That 'delete wpe' call is the one crashing the startd. Since it's only
called if it's a Windows binary this might not be related to your crash.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/


--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685