Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job running on two hosts?

Date: Wed, 17 Nov 2004 09:45:48 -0600 (CST)
From: Chris Green <greenc@xxxxxxxx>
Subject: Re: [Condor-users] job running on two hosts?

"/usr/sbin/condor_schedd" on "jdc.math.uwo.ca" died due to signal 11.
Condor will automatically restart this process in 10 seconds.

But now the question is, why did it die?

Of course, my reply below is a non-sequitor since I misread the original email and thought the SEGV applied to your job, not the schedd. My bad: coffee no function Chris well without.

Chris.

Hi,

SEGVs have several possible causes but most boil down to (1) bad; or (2) inconsistent code; or (3) bad hardware. If your job doesn't use random numbers (or uses pseudo-random numbers with the same seed) and a job completes on one node but fails on the other, you are looking at explanations (2) or (3). Our site is restricted for reasons I won't go into to running vanilla jobs only, so I can't comment on what happens in the standard universe if shared libraries upon which your program depends are different or otherwise inconsistent from one node to another. (3) though, tends to be memory or occasionally disk, and only *very* rarely, CPU. If 'twere the latter though, you'd be noticing far more than just the occasional job fail. If you can't tie a node-specific failure down to shared libraries or similar, run some hardware tests.

I hope this helps you find your problem.

Chris.
Dan
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users


--
Chris Green, MiniBooNE / LANL. Email greenc@xxxxxxxx
Tel: (630) 840-2167. Fax: (630) 840-3867

References:
- [Condor-users] Unable to re-submit dag rescue file
  - From: Michael Remijan
- Re: [Condor-users] Unable to re-submit dag rescue file
  - From: Peter F. Couvares
- [Condor-users] job running on two hosts?
  - From: Dan Christensen
- Re: [Condor-users] job running on two hosts?
  - From: Dan Christensen
- Re: [Condor-users] job running on two hosts?
  - From: Chris Green

Prev by Date: Re: [Condor-users] job running on two hosts?
Next by Date: [Condor-users] Abnormal termination (signal 15)
Previous by thread: Re: [Condor-users] job running on two hosts?
Next by thread: Re: [Condor-users] job running on two hosts?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] job running on two hosts?