[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



Is there anything in OSX Console indicating anything untoward?

Has the master leaked handles perhaps?

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Nick LeRoy
Sent: 26 August 2009 17:25
To: condor-users@xxxxxxxxxxx
Subject: Re: [Condor-users] Intermittent Condor startd crashes

On Wednesday 26 August 2009, Craig Struble wrote:
Craig,

> Well, I had hoped that <8 slots would fix things, but after running
> Condor longer, even 4 slots fails on this one OS X machine (while the
> other 22 with 2 slots each run fine, running the same operating system
> and condor binaries).
>
> I'm not sure my problem is directly related, being on OS X. In the
> StarterLog.slot1 on my machine, the end looks like:
>
> 08/22 10:22:19 Job 26912.0 set to execute immediately
> 08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
> 08/22 10:22:19 IWD: /var/condor/execute/dir_94482
> 08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
> job_cluster-2.stdout
> 08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
> condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
> 08/22 10:22:20 Create_Process succeeded, pid=94490
> 08/22 11:14:59 Process exited, pid=94490, status=0
> 08/22 11:14:59 Got SIGQUIT.  Performing fast shutdown.
> 08/22 11:14:59 ShutdownFast all jobs.
> 08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482 EXITING
> WITH STATUS 0
>
> After that, no jobs will run on that slot and running condor_restart
> fails to relaunch condor (all daemons except condor_master are killed
> but execing new ones fails for some unknown reason).

Just to be clear...  The startd has crashed before the starter gets the QUIT?  
And, after that, the master can't even exec daemons?  Is that right?  Is 
there anything interesting in the MasterLog or StartLog?

-Nick

-- 
           <<< The Matrix is everywhere. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/

----
Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.
The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.
All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.
Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.
Gloucester Research Limited is a company registered in England and Wales with company number 04267560.
----