[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Intermittent Condor startd crashes
- Date: Wed, 26 Aug 2009 10:58:58 -0500
- From: Craig Struble <craig.struble@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Intermittent Condor startd crashes
Well, I had hoped that <8 slots would fix things, but after running
Condor longer, even 4 slots fails on this one OS X machine (while the
other 22 with 2 slots each run fine, running the same operating system
and condor binaries).
I'm not sure my problem is directly related, being on OS X. In the
StarterLog.slot1 on my machine, the end looks like:
08/22 10:22:19 Job 26912.0 set to execute immediately
08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
08/22 10:22:19 IWD: /var/condor/execute/dir_94482
08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
08/22 10:22:20 Create_Process succeeded, pid=94490
08/22 11:14:59 Process exited, pid=94490, status=0
08/22 11:14:59 Got SIGQUIT. Performing fast shutdown.
08/22 11:14:59 ShutdownFast all jobs.
08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482 EXITING
WITH STATUS 0
After that, no jobs will run on that slot and running condor_restart
fails to relaunch condor (all daemons except condor_master are killed
but execing new ones fails for some unknown reason).
On Aug 21, 2009, at 10:04 AM, Ian Chesal wrote:
Thanks for reporting this, Ian. I was beginning to go crazy
track down a similar issue.
You may not be going crazy. There's some sort of race condition in
cleaned up Windows when you're using hooks. That's for certain.
Interestingly, I've been having problems with Condor 7.3.1 on an 8
slot OS X 10.5 machine. I haven't been able to figure out why
machine would have slots fail after running a job, while all the
others (2 slot) using the same binaries and configuration ran fine.
Condor also fails to exec daemons when using condor_restart or
condor_off followed by condor_on when using 8 slots.
Reducing the number of slots to 4 makes everything run fine.
I can still make my <8 slot machines fail with hooks. It just takes a
little longer to hit the problem. And my failure is definitely only on
Windows code that cleans up pipes. But using 8 slots of more gets me
the problem very quickly. The error is in deamon_core.cpp,
// If Close_Pipe is called on a Windows WritePipeEnd and there is
// an outstanding overlapped write operation, we can't immediately
// close the pipe. Instead, we call this function in a separate
// thread and close the pipe once the operation is complete
unsigned __stdcall pipe_close_thread(void *arg)
WritePipeEnd* wpe = (WritePipeEnd*)arg;
dprintf(D_DAEMONCORE, "finally closing pipe %p\n", wpe);
That 'delete wpe' call is the one crashing the startd. Since it's only
called if it's a Windows binary this might not be related to your
This message may contain information that is confidential or
otherwise protected from disclosure. If you are not the intended
recipient, you are hereby notified that any use, disclosure,
dissemination, distribution, or copying of this message, or any
attachments, is strictly prohibited. If you have received this
message in error, please advise the sender by reply e-mail, and
delete the message and any attachments. Thank you.
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
You can also unsubscribe by visiting
The archives can be found at:
Craig A. Struble, Ph.D. | 369 Cudahy Hall | Marquette University
Associate Professor of Computer Science | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx