[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] AMD Opteron Crashes



Hi Andrey

Please check the /etc/hosts file Just in case if thing are not defined there in the form of :

127.0.0.1 localhost.localdomain     localhost
<IP>     machine.domain.com    machine

On Fri, 2005-03-04 at 21:09, Andrey Kaliazin wrote:
Steffen,

I have suspicion that it could be something to do LDAP authentication, not
with AMD64,
because we are trying to install Condor 6.6.8 on a Linux cluster, running
RH9 on dual-Xeon nodes
and getting similar crashes (SCHEDD ...died on signal 11) when it fails to
identify the user, no matter
whether we use NFS or not.

Does anyone else successfully run Condor on systems where UIDs/GIDs are not
provided by
passwd file but via LDAP?

Andrey

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> Steffen Prohaska
> Sent: Friday, March 04, 2005 3:16 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] AMD Opteron Crashes
> 
> Hi,
> In  
> https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/
> msg01368.shtml I read that it should be possible to use the  
> linux-x86-glibc23-dynamic binary on an 64 bit Opteron system to run  
> Condor.
> 
> Everything's working fine until condor tries to start a job. The  
> condor_starter crashes with a SEGFAULT.
> 
> I tried this with the condor-6.6.8-linux-x86-glibc22-dynamic.tar.gz,  
> condor-6.6.8-linux-x86-glibc23-dynamic.tar.gz, and the  
> condor-6.7.5-linux-x86-glibc23-dynamic.tar.gz. The behaviour 
> is always  
> similar. We're running a Suse Enterprise Linux. User information is  
> stored in LDAP. I attached excerpts from log files below. If more  
> details were helpful, I could also provide them.
> 
> Any thoughts on this? Is anyone successfully running Condor on a  
> similar Opteron system?
> 
> 	Steffen
> 
> 
> --- System info
> acorn:/ # cat /etc/SuSE-release
> SUSE LINUX Enterprise Server 9 (x86_64)
> VERSION = 9
> acorn:/ # uname -a
> Linux acorn 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC 
> 2005 x86_64  
> x86_64 x86_64 GNU/Linux
> 
> --- From StartLog:
> StartLog:3/4 15:48:32 Starter pid 18488 died on signal 11 (signal 11)
> 
> --- From /var/log/messages
> Mar  4 15:48:32 acorn kernel: condor_starter[18488]: segfault at  
> 00000000a4e0efc5 rip 00000000559a4dac rsp 00000000ffffc4a8 error 4
> 
> --- From StarterLog.vm2
> 3/4 15:48:29 (fd:9) PASSWD_CACHE_REFRESH is undefined, using default  
> value of 300
> 3/4 15:48:29 (fd:9) Finding local host information, calling  
> gethostname()
> [...]
> 3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor")  
> failed: user not found
> 3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor")  
> failed: user not found
> 3/4 15:48:29 (fd:9) PRIV_UNKNOWN --> PRIV_CONDOR at  
> daemon_core_main.C:1382
> 3/4 15:48:29 (fd:9) KEYCACHE: created: 82ca8d8
> 3/4 15:48:29 (fd:9)  
> ******************************************************
> 3/4 15:48:29 (fd:9) ** condor_starter (CONDOR_STARTER) STARTING UP
> 3/4 15:48:30 (fd:9) **  
> /vis/data/people/condor/linux-glibc23/sbin/condor_starter
> 3/4 15:48:30 (fd:9) ** $CondorVersion: 6.6.8 Jan 27 2005 $
> 3/4 15:48:30 (fd:9) ** $CondorPlatform: I386-LINUX_RH9 $
> 3/4 15:48:30 (fd:9) ** PID = 18488
> 3/4 15:48:30 (fd:9) ** Running as root: Privilege switching in effect
> 3/4 15:48:30 (fd:9)  
> ******************************************************
> [...]
> TransferSocket = "<130.73.68.82:21118>"
> ShadowVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"
> UidDomain = "zib.de"
> 3/4 15:48:32 (fd:11) --- End of ClassAd ---
> 3/4 15:48:32 (fd:11) STARTER_TIMEOUT_MULTIPLIER is undefined, using  
> default value of 0
> 3/4 15:48:32 (fd:11) New Daemon obj (shadow) name: "onyx3.zib.de",  
> pool: "NULL", addr: "NULL"
> 3/4 15:48:32 (fd:11) Version of Shadow is $CondorVersion: 
> 6.6.8 Jan 27  
> 2005 $
> 3/4 15:48:32 (fd:11) Starter communicating with condor_shadow  
> <130.73.68.82:21118>
> 3/4 15:48:32 (fd:11) Submitting machine is "onyx3.zib.de"
> 3/4 15:48:32 (fd:11) Doing CONDOR_register_starter_info
> 3/4 15:48:32 (fd:11) ShouldTransferFiles is "NO", NOT 
> transfering files
> 3/4 15:48:32 (fd:11) Submit UidDomain: "zib.de"
> 3/4 15:48:32 (fd:11)  Local UidDomain: "zib.de"
> 3/4 15:48:32 (fd:11) Initialized user_priv as "..."
> [ at this time the daemon crashes ]
> --- End of log
> 
> --
> Steffen Prohaska <prohaska@xxxxxx>  <http://www.zib.de/prohaska/>
> Zuse Institute Berlin, Takustraße 7, D-14195 Berlin-Dahlem, Germany
> +49 (30) 841 85-337, fax -107
> 1024D/DA749299 print 8B59 83A8 A43D E0E2 DEDB   D479 3157 
> 2FEA DA74 9299
> 


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
Thanks and Regards
P r a s h a n t  L a l

Cadence Design Systems

Noida Export Processing Zone,
Noida - 201301,
Phone:+91 120 2562842, extn 4009
Fax:+91 120 2562231
Cell:+91 98101-44168

mailto:
lalp@ cadence.com