[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] job running on two hosts?
- Date: Wed, 17 Nov 2004 09:45:48 -0600 (CST)
- From: Chris Green <greenc@xxxxxxxx>
- Subject: Re: [Condor-users] job running on two hosts?
"/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
Condor will automatically restart this process in 10 seconds.
But now the question is, why did it die?
Of course, my reply below is a non-sequitor since I misread the original
email and thought the SEGV applied to your job, not the schedd. My bad:
coffee no function Chris well without.
SEGVs have several possible causes but most boil down to (1) bad; or (2)
inconsistent code; or (3) bad hardware. If your job doesn't use random
numbers (or uses pseudo-random numbers with the same seed) and a job
completes on one node but fails on the other, you are looking at explanations
(2) or (3). Our site is restricted for reasons I won't go into to running
vanilla jobs only, so I can't comment on what happens in the standard
universe if shared libraries upon which your program depends are different or
otherwise inconsistent from one node to another. (3) though, tends to be
memory or occasionally disk, and only *very* rarely, CPU. If 'twere the
latter though, you'd be noticing far more than just the occasional job fail.
If you can't tie a node-specific failure down to shared libraries or similar,
run some hardware tests.
I hope this helps you find your problem.
Condor-users mailing list
Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx
Tel: (630) 840-2167. Fax: (630) 840-3867