[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job running on two hosts?

Date: Tue, 16 Nov 2004 02:11:34 -0500

"/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
Condor will automatically restart this process in 10 seconds.

But now the question is, why did it die?


SEGVs have several possible causes but most boil down to (1) bad; or (2) inconsistent code; or (3) bad hardware. If your job doesn't use random numbers (or uses pseudo-random numbers with the same seed) and a job completes on one node but fails on the other, you are looking at explanations (2) or (3). Our site is restricted for reasons I won't go into to running vanilla jobs only, so I can't comment on what happens in the standard universe if shared libraries upon which your program depends are different or otherwise inconsistent from one node to another. (3) though, tends to be memory or occasionally disk, and only *very* rarely, CPU. If 'twere the latter though, you'd be noticing far more than just the occasional job fail. If you can't tie a node-specific failure down to shared libraries or similar, run some hardware tests.

I hope this helps you find your problem.


Dan _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users

-- Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx Tel: (630) 840-2167. Fax: (630) 840-3867