Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?

Date: Tue, 08 Aug 2006 13:13:39 +0100
From: Mark Calleja <M.Calleja@xxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?

>Jaime Frey wrote:
>
>>Mark Calleja wrote:
>>
>>>Jaime Frey wrote:
>>>
>>>>Mark Calleja wrote:
>>>>
>>>>We're running a linux pool with Condor 6.6.11 and we persistently see a
>>>>number of vanilla jobs whose Starter keeps dying with (from the
>>>>StartLog):
>>>>
>>>>7/17 08:07:03 Starter pid 16900 died on signal 11 (signal 11)
>>>>7/17 08:07:03 vm1: State change: starter exited
>>>>

>>>>The StarterLog shows nothing, even with full debug turned on. Thejobs then keep resubmitting themselves to die a>>>>similar death. As far as I can tell this is the daemon itselfdying, not the application that its running (which>>>>runs fine from the console). We're using the dynamically linkedbinaries under Debian "etch". Can anyone shed any>>>>light why this should be happening, and more importantly how we canfix it?

>>>>
>>>>Ta,
>>>>Mark
>>>
>>>What does the starter log say around the time of the segfault?
>>>Are there any core files in the condor log directory?
>>
>>Hi Jaime,
>>
>>I'm afraid there's nothing in the StarterLog. Here's the relevant
>>snippet for the job mentioned above:
>>
>>7/17 08:04:25 VM1_USER set, so running job as condor_user
>>7/17 08:04:30 File transfer completed successfully.
>>7/17 08:04:31 Starting a VANILLA universe job with ID: 251.23
>>7/17 08:04:31 IWD: /home/condor/execute/dir_16900
>>7/17 08:04:31 Output file: /home/condor/execute/dir_16900/out.23
>>7/17 08:04:31 Error file: /home/condor/execute/dir_16900/err.23
>>7/17 08:04:31 Renice expr "19" evaluated to 19

>>7/17 08:04:31 About to exec/home/condor/execute/dir_16900/condor_exec.exe noDark 100000 -23 8 1

>>7/17 08:04:31 Create_Process succeeded, pid=16902
>>7/17 08:07:05 ******************************************************
>>7/17 08:07:05 ** condor_starter (CONDOR_STARTER) STARTING UP

>>7/17 08:07:05 ** /usr/Condor/RH9/condor-6.6.11-dynamic/sbin/condor_starter

>>7/17 08:07:05 ** $CondorVersion: 6.6.11 Mar 23 2006 $
>>7/17 08:07:05 ** $CondorPlatform: I386-LINUX_RH9 $
>>7/17 08:07:05 ** PID = 22324
>>7/17 08:07:05 ******************************************************
>>7/17 08:07:05 Using config file: /home/condor/condor_config
>>7/17 08:07:05 Using local config files: /home/condor/ condor_config.local
>>
>>Note how there's nothing about the job's death, and a new one just
>>starts immediately afterwards. Also, there's no core file being left
>>behind anywhere. Sorry!
>
>Odd. Is this happening regularly or just the one time?

Reasonably regularly. However, a colleague of mine who's been seeing asimilar problem in his pool (different Linux distro and applications)got the following piece advice when he recently contacted Condor'shelpdesk: "Wisconsin guys told me that the "Signal 11" bug has beenfixed in 6.8.x release and advise me to upgrade". Does this ring anybells with you? It seems as if someone within UWCS's corridors of poweris aware of the problem.


Cheers,
Mark

Follow-Ups:
- Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?
  - From: Erik Paulson

Prev by Date: [Condor-users] Can a condor job wait in idle state
Next by Date: Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?
Previous by thread: Re: [Condor-users] Can a condor job wait in idle state
Next by thread: Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Why does the Starter deamon keeps dying with SEGV?