[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs die with signal 11



Hi,

It looks like that this was my goof, in that I was running an executable built on a machine with glibc v2.3.x on a v2.2.x box. Having now built the executable on a matching platform, I no longer get those pesky signal 11's. However, other problems raise their head which I'll mention in a separate thread.

Thx for the help,
Mark

Nick LeRoy wrote:

On Tue September 28 2004 12:25 pm, mcal00@xxxxxxxxxxxxx wrote:


Hi, we've got an old linux cluster (i286 processors running RH7.2) that
we've converted into a Condor pool and we constantly see jobs dying with
Shadow exceptions, with the only clue in the StarterLog files being of the
form (Condor v. 6.6.6 all round):

9/27 14:12:33 vm2: Got activate_claim request from shadow
(<172.24.116.193:42835>)
9/27 14:12:33 vm2: Remote job ID is 352.0
9/27 14:12:33 vm2: Got universe "VANILLA" (5) from request classad
9/27 14:12:33 vm2: State change: claim-activation protocol successful
9/27 14:12:33 vm2: Changing activity: Idle -> Busy
9/27 14:21:22 Starter pid 21927 died on signal 11 (signal 11)
9/27 14:21:22 vm2: State change: starter exited
9/27 14:21:22 vm2: Changing activity: Busy -> Idle

What's that signal 11 mean? I notice that someone spotted something similar
under solaris last year (message 476), and Erik Paulson suggested that it
may have been a bug. Was it ever resolved?



Signal 11 means "segmentation fault" (SEGV); i.e. the program crashed. Mostly likely, this is due to a buggy application being started by Condor.




These jobs are coming in from flocked pools across the campus, so the
network they have to traverse is slightly unfriendlier than your average
LAN. Could such a signal be due to a network glitch?



I really can't see how network topology could cause a job to get a SEGV, except in some unusual circumstances.


One thing to try would be to run the job's executable directly on a machine that you've seen it crash on (directly as in without Condor involved at all). If it runs like that without crashing, then there's some unintended interaction, but, most likely, you'll see the same behavior. In the above log, the job had been running for ~11 minutes, so you shouldn't have to wait very long.

There is one other thing that I can thing of looking into: environment variables. User jobs may get started with a very different set of environment variables than those that they were submitted with, which could possibly cause a buggy application to crash as well.

Hope this helps

-Nick





--
Dr Mark Calleja
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax  (+44/0) 1223 765900
http://www.esc.cam.ac.uk/~mcal00