[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] AMD Opteron Crashes



Andrey Kaliazin wrote:

Steffen,

I have suspicion that it could be something to do LDAP authentication, not
with AMD64,
because we are trying to install Condor 6.6.8 on a Linux cluster, running
RH9 on dual-Xeon nodes
and getting similar crashes (SCHEDD ...died on signal 11) when it fails to
identify the user, no matter
whether we use NFS or not.

Does anyone else successfully run Condor on systems where UIDs/GIDs are not
provided by
passwd file but via LDAP?

Andrey



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Steffen Prohaska
Sent: Friday, March 04, 2005 3:16 PM
To: Condor-Users Mail List
Subject: [Condor-users] AMD Opteron Crashes


Hi,
In https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/ msg01368.shtml I read that it should be possible to use the linux-x86-glibc23-dynamic binary on an 64 bit Opteron system to run Condor.


Everything's working fine until condor tries to start a job. The condor_starter crashes with a SEGFAULT.

I tried this with the condor-6.6.8-linux-x86-glibc22-dynamic.tar.gz, condor-6.6.8-linux-x86-glibc23-dynamic.tar.gz, and the condor-6.7.5-linux-x86-glibc23-dynamic.tar.gz. The behaviour is always similar. We're running a Suse Enterprise Linux. User information is stored in LDAP. I attached excerpts from log files below. If more details were helpful, I could also provide them.

Any thoughts on this? Is anyone successfully running Condor on a similar Opteron system?

	Steffen


--- System info
acorn:/ # cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
acorn:/ # uname -a
Linux acorn 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux


--- From StartLog:
StartLog:3/4 15:48:32 Starter pid 18488 died on signal 11 (signal 11)

--- From /var/log/messages
Mar 4 15:48:32 acorn kernel: condor_starter[18488]: segfault at 00000000a4e0efc5 rip 00000000559a4dac rsp 00000000ffffc4a8 error 4


--- From StarterLog.vm2
3/4 15:48:29 (fd:9) PASSWD_CACHE_REFRESH is undefined, using default value of 300
3/4 15:48:29 (fd:9) Finding local host information, calling gethostname()
[...]
3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
3/4 15:48:29 (fd:9) PRIV_UNKNOWN --> PRIV_CONDOR at daemon_core_main.C:1382
3/4 15:48:29 (fd:9) KEYCACHE: created: 82ca8d8
3/4 15:48:29 (fd:9) ******************************************************
3/4 15:48:29 (fd:9) ** condor_starter (CONDOR_STARTER) STARTING UP
3/4 15:48:30 (fd:9) ** /vis/data/people/condor/linux-glibc23/sbin/condor_starter
3/4 15:48:30 (fd:9) ** $CondorVersion: 6.6.8 Jan 27 2005 $
3/4 15:48:30 (fd:9) ** $CondorPlatform: I386-LINUX_RH9 $
3/4 15:48:30 (fd:9) ** PID = 18488
3/4 15:48:30 (fd:9) ** Running as root: Privilege switching in effect
3/4 15:48:30 (fd:9) ******************************************************
[...]
TransferSocket = "<130.73.68.82:21118>"
ShadowVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"
UidDomain = "zib.de"
3/4 15:48:32 (fd:11) --- End of ClassAd ---
3/4 15:48:32 (fd:11) STARTER_TIMEOUT_MULTIPLIER is undefined, using default value of 0
3/4 15:48:32 (fd:11) New Daemon obj (shadow) name: "onyx3.zib.de", pool: "NULL", addr: "NULL"
3/4 15:48:32 (fd:11) Version of Shadow is $CondorVersion: 6.6.8 Jan 27 2005 $
3/4 15:48:32 (fd:11) Starter communicating with condor_shadow <130.73.68.82:21118>
3/4 15:48:32 (fd:11) Submitting machine is "onyx3.zib.de"
3/4 15:48:32 (fd:11) Doing CONDOR_register_starter_info
3/4 15:48:32 (fd:11) ShouldTransferFiles is "NO", NOT transfering files
3/4 15:48:32 (fd:11) Submit UidDomain: "zib.de"
3/4 15:48:32 (fd:11) Local UidDomain: "zib.de"
3/4 15:48:32 (fd:11) Initialized user_priv as "..."
[ at this time the daemon crashes ]
--- End of log


--
Steffen Prohaska <prohaska@xxxxxx> <http://www.zib.de/prohaska/>
Zuse Institute Berlin, Takustraße 7, D-14195 Berlin-Dahlem, Germany
+49 (30) 841 85-337, fax -107
1024D/DA749299 print 8B59 83A8 A43D E0E2 DEDB D479 3157 2FEA DA74 9299


Hi,

I have Condor running in our Desktops that authenticate through OpenLDAP. Version is 6.7.5, but I was earlier running 6.6.7 without the issue you are talking about. The point to note is that all our desktops are x86. There is a dual Opteron desktop but that is not in the Condor pool. I could try to test that box if you are interested.

Prakash