[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



On Wednesday 26 August 2009, Craig Struble wrote:
Craig,

> Well, I had hoped that <8 slots would fix things, but after running
> Condor longer, even 4 slots fails on this one OS X machine (while the
> other 22 with 2 slots each run fine, running the same operating system
> and condor binaries).
>
> I'm not sure my problem is directly related, being on OS X. In the
> StarterLog.slot1 on my machine, the end looks like:
>
> 08/22 10:22:19 Job 26912.0 set to execute immediately
> 08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
> 08/22 10:22:19 IWD: /var/condor/execute/dir_94482
> 08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
> job_cluster-2.stdout
> 08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
> condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
> 08/22 10:22:20 Create_Process succeeded, pid=94490
> 08/22 11:14:59 Process exited, pid=94490, status=0
> 08/22 11:14:59 Got SIGQUIT.  Performing fast shutdown.
> 08/22 11:14:59 ShutdownFast all jobs.
> 08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482 EXITING
> WITH STATUS 0
>
> After that, no jobs will run on that slot and running condor_restart
> fails to relaunch condor (all daemons except condor_master are killed
> but execing new ones fails for some unknown reason).

Just to be clear...  The startd has crashed before the starter gets the QUIT?  
And, after that, the master can't even exec daemons?  Is that right?  Is 
there anything interesting in the MasterLog or StartLog?

-Nick

-- 
           <<< The Matrix is everywhere. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences