[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] AMD Opteron Crashes



Hi,
In https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/ msg01368.shtml I read that it should be possible to use the linux-x86-glibc23-dynamic binary on an 64 bit Opteron system to run Condor.


Everything's working fine until condor tries to start a job. The condor_starter crashes with a SEGFAULT.

I tried this with the condor-6.6.8-linux-x86-glibc22-dynamic.tar.gz, condor-6.6.8-linux-x86-glibc23-dynamic.tar.gz, and the condor-6.7.5-linux-x86-glibc23-dynamic.tar.gz. The behaviour is always similar. We're running a Suse Enterprise Linux. User information is stored in LDAP. I attached excerpts from log files below. If more details were helpful, I could also provide them.

Any thoughts on this? Is anyone successfully running Condor on a similar Opteron system?

	Steffen


--- System info
acorn:/ # cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
acorn:/ # uname -a
Linux acorn 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux


--- From StartLog:
StartLog:3/4 15:48:32 Starter pid 18488 died on signal 11 (signal 11)

--- From /var/log/messages
Mar 4 15:48:32 acorn kernel: condor_starter[18488]: segfault at 00000000a4e0efc5 rip 00000000559a4dac rsp 00000000ffffc4a8 error 4


--- From StarterLog.vm2
3/4 15:48:29 (fd:9) PASSWD_CACHE_REFRESH is undefined, using default value of 300
3/4 15:48:29 (fd:9) Finding local host information, calling gethostname()
[...]
3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
3/4 15:48:29 (fd:9) PRIV_UNKNOWN --> PRIV_CONDOR at daemon_core_main.C:1382
3/4 15:48:29 (fd:9) KEYCACHE: created: 82ca8d8
3/4 15:48:29 (fd:9) ******************************************************
3/4 15:48:29 (fd:9) ** condor_starter (CONDOR_STARTER) STARTING UP
3/4 15:48:30 (fd:9) ** /vis/data/people/condor/linux-glibc23/sbin/condor_starter
3/4 15:48:30 (fd:9) ** $CondorVersion: 6.6.8 Jan 27 2005 $
3/4 15:48:30 (fd:9) ** $CondorPlatform: I386-LINUX_RH9 $
3/4 15:48:30 (fd:9) ** PID = 18488
3/4 15:48:30 (fd:9) ** Running as root: Privilege switching in effect
3/4 15:48:30 (fd:9) ******************************************************
[...]
TransferSocket = "<130.73.68.82:21118>"
ShadowVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"
UidDomain = "zib.de"
3/4 15:48:32 (fd:11) --- End of ClassAd ---
3/4 15:48:32 (fd:11) STARTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
3/4 15:48:32 (fd:11) New Daemon obj (shadow) name: "onyx3.zib.de", pool: "NULL", addr: "NULL"
3/4 15:48:32 (fd:11) Version of Shadow is $CondorVersion: 6.6.8 Jan 27 2005 $
3/4 15:48:32 (fd:11) Starter communicating with condor_shadow <130.73.68.82:21118>
3/4 15:48:32 (fd:11) Submitting machine is "onyx3.zib.de"
3/4 15:48:32 (fd:11) Doing CONDOR_register_starter_info
3/4 15:48:32 (fd:11) ShouldTransferFiles is "NO", NOT transfering files
3/4 15:48:32 (fd:11) Submit UidDomain: "zib.de"
3/4 15:48:32 (fd:11) Local UidDomain: "zib.de"
3/4 15:48:32 (fd:11) Initialized user_priv as "..."
[ at this time the daemon crashes ]
--- End of log


--
Steffen Prohaska <prohaska@xxxxxx>  <http://www.zib.de/prohaska/>
Zuse Institute Berlin, Takustraße 7, D-14195 Berlin-Dahlem, Germany
+49 (30) 841 85-337, fax -107
1024D/DA749299 print 8B59 83A8 A43D E0E2 DEDB   D479 3157 2FEA DA74 9299

Attachment: PGP.sig
Description: This is a digitally signed message part