[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?



>Jaime Frey wrote:
>
>>Mark Calleja wrote:
>>
>>We're running a linux pool with Condor 6.6.11 and we persistently see a
>>number of vanilla jobs whose Starter keeps dying with (from the StartLog):
>>
>>7/17 08:07:03 Starter pid 16900 died on signal 11 (signal 11)
>>7/17 08:07:03 vm1: State change: starter exited
>>
>>The StarterLog shows nothing, even with full debug turned on. The jobs
>>then keep resubmitting themselves to die a similar death. As far as I
>>can tell this is the daemon itself dying, not the application that its
>>running (which runs fine from the console). We're using the dynamically
>>linked binaries under Debian "etch". Can anyone shed any light why this
>>should be happening, and more importantly how we can fix it?
>>
>>Ta,
>>Mark
>
>What does the starter log say around the time of the segfault?
>Are there any core files in the condor log directory?

Hi Jaime,

I'm afraid there's nothing in the StarterLog. Here's the relevant snippet for the job mentioned above:

7/17 08:04:25 VM1_USER set, so running job as condor_user
7/17 08:04:30 File transfer completed successfully.
7/17 08:04:31 Starting a VANILLA universe job with ID: 251.23
7/17 08:04:31 IWD: /home/condor/execute/dir_16900
7/17 08:04:31 Output file: /home/condor/execute/dir_16900/out.23
7/17 08:04:31 Error file: /home/condor/execute/dir_16900/err.23
7/17 08:04:31 Renice expr "19" evaluated to 19
7/17 08:04:31 About to exec /home/condor/execute/dir_16900/condor_exec.exe noDark 100000 -23 8 1
7/17 08:04:31 Create_Process succeeded, pid=16902
7/17 08:07:05 ******************************************************
7/17 08:07:05 ** condor_starter (CONDOR_STARTER) STARTING UP
7/17 08:07:05 ** /usr/Condor/RH9/condor-6.6.11-dynamic/sbin/condor_starter
7/17 08:07:05 ** $CondorVersion: 6.6.11 Mar 23 2006 $
7/17 08:07:05 ** $CondorPlatform: I386-LINUX_RH9 $
7/17 08:07:05 ** PID = 22324
7/17 08:07:05 ******************************************************
7/17 08:07:05 Using config file: /home/condor/condor_config
7/17 08:07:05 Using local config files: /home/condor/condor_config.local

Note how there's nothing about the job's death, and a new one just starts immediately afterwards. Also, there's no core file being left behind anywhere. Sorry!

Chers,
Mark

--
Dr Mark Calleja
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax  (+44/0) 1223 765900
http://www.escience.cam.ac.uk/~mcal00