[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SCHEDD's crashes. Was: [Condor-users] AMD Opteron Crashes



Hi all,

Our problem is of weird nature, at least I cannot find it out. What is
happening now is that
the user say "user5" group "research" submits the job to the pool. Then
Negotiator matches 
the job, job is submitted, the relevant line is written to the job.log file
but then failure occurs
with the message that "condor cannot write to the output file job.out as
user "user5"
Job remains idle and at attempt to remove it from the queue the SCHEDD dies
suddenly
and silently in spite D_FULLDEBUG setting. 
And it won't start again until the $(CONDOR)/spool/job_queue.log file is
deleted completely.
Close look revealed that the job.out file is created with ownership of
"user5" (which is correct)
and group "nfsnobody" which is not correct apparently. No wonder condor
cannot write into it.
And it looks like it happens no matter if condor is run as root or as
condor, whether USE_NFS
is False or True. Our sysadmin on the cluster is very reluctant to let
condor run as root, btw.

We have even entered user condor into LDAP database instead of passwd file
and made it a 
member of the same "research" group and made the user home dir writable by
the group, all in vain.
It looks like condor daemons do not want to switch properly uid/gid for some
reason.

One thing has to be said - we have NFS mounted /home on top of IBM's GPFS
for some
other purposes. Does it matter for Condor? Who knows.

I could use some advice on it. Anyone, please?

Regards,

Andrey

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Prashant Lal
> Sent: Monday, March 07, 2005 7:08 AM
> To: Condor-Users Mail List
> Subject: RE: [Condor-users] AMD Opteron Crashes
> 
> Hi Andrey
> 
> Please check the /etc/hosts file Just in case if thing are 
> not defined there in the form of :
> 
> 127.0.0.1 localhost.localdomain     localhost
> <IP>     machine.domain.com    machine
> 
> On Fri, 2005-03-04 at 21:09, Andrey Kaliazin wrote: 
> 
> 	Steffen,
> 	
> 	I have suspicion that it could be something to do LDAP 
> authentication, not
> 	with AMD64,
> 	because we are trying to install Condor 6.6.8 on a 
> Linux cluster, running
> 	RH9 on dual-Xeon nodes
> 	and getting similar crashes (SCHEDD ...died on signal 
> 11) when it fails to
> 	identify the user, no matter
> 	whether we use NFS or not.
> 	
> 	Does anyone else successfully run Condor on systems 
> where UIDs/GIDs are not
> 	provided by
> 	passwd file but via LDAP?
> 	
> 	Andrey
> 	
> 	> -----Original Message-----
> 	> From: condor-users-bounces@xxxxxxxxxxx 
> 	> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> 	> Steffen Prohaska
> 	> Sent: Friday, March 04, 2005 3:16 PM
> 	> To: Condor-Users Mail List
> 	> Subject: [Condor-users] AMD Opteron Crashes
> 	> 
> 	> Hi,
> 	> In  
> 	> 
> https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/ 
> <https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/> 
> 	> msg01368.shtml I read that it should be possible to use the  
> 	> linux-x86-glibc23-dynamic binary on an 64 bit Opteron 
> system to run  
> 	> Condor.
> 	> 
> 	> Everything's working fine until condor tries to start 
> a job. The  
> 	> condor_starter crashes with a SEGFAULT.
> 	> 
> 	> I tried this with the 
> condor-6.6.8-linux-x86-glibc22-dynamic.tar.gz,  
> 	> condor-6.6.8-linux-x86-glibc23-dynamic.tar.gz, and the  
> 	> condor-6.7.5-linux-x86-glibc23-dynamic.tar.gz. The behaviour 
> 	> is always  
> 	> similar. We're running a Suse Enterprise Linux. User 
> information is  
> 	> stored in LDAP. I attached excerpts from log files 
> below. If more  
> 	> details were helpful, I could also provide them.
> 	> 
> 	> Any thoughts on this? Is anyone successfully running 
> Condor on a  
> 	> similar Opteron system?
> 	> 
> 	> 	Steffen
> 	> 
> 	> 
> 	> --- System info
> 	> acorn:/ # cat /etc/SuSE-release
> 	> SUSE LINUX Enterprise Server 9 (x86_64)
> 	> VERSION = 9
> 	> acorn:/ # uname -a
> 	> Linux acorn 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC 
> 	> 2005 x86_64  
> 	> x86_64 x86_64 GNU/Linux
> 	> 
> 	> --- From StartLog:
> 	> StartLog:3/4 15:48:32 Starter pid 18488 died on 
> signal 11 (signal 11)
> 	> 
> 	> --- From /var/log/messages
> 	> Mar  4 15:48:32 acorn kernel: condor_starter[18488]: 
> segfault at  
> 	> 00000000a4e0efc5 rip 00000000559a4dac rsp 
> 00000000ffffc4a8 error 4
> 	> 
> 	> --- From StarterLog.vm2
> 	> 3/4 15:48:29 (fd:9) PASSWD_CACHE_REFRESH is 
> undefined, using default  
> 	> value of 300
> 	> 3/4 15:48:29 (fd:9) Finding local host information, calling  
> 	> gethostname()
> 	> [...]
> 	> 3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): 
> getpwnam("condor")  
> 	> failed: user not found
> 	> 3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): 
> getpwnam("condor")  
> 	> failed: user not found
> 	> 3/4 15:48:29 (fd:9) PRIV_UNKNOWN --> PRIV_CONDOR at  
> 	> daemon_core_main.C:1382
> 	> 3/4 15:48:29 (fd:9) KEYCACHE: created: 82ca8d8
> 	> 3/4 15:48:29 (fd:9)  
> 	> ******************************************************
> 	> 3/4 15:48:29 (fd:9) ** condor_starter 
> (CONDOR_STARTER) STARTING UP
> 	> 3/4 15:48:30 (fd:9) **  
> 	> /vis/data/people/condor/linux-glibc23/sbin/condor_starter
> 	> 3/4 15:48:30 (fd:9) ** $CondorVersion: 6.6.8 Jan 27 2005 $
> 	> 3/4 15:48:30 (fd:9) ** $CondorPlatform: I386-LINUX_RH9 $
> 	> 3/4 15:48:30 (fd:9) ** PID = 18488
> 	> 3/4 15:48:30 (fd:9) ** Running as root: Privilege 
> switching in effect
> 	> 3/4 15:48:30 (fd:9)  
> 	> ******************************************************
> 	> [...]
> 	> TransferSocket = "<130.73.68.82:21118>"
> 	> ShadowVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"
> 	> UidDomain = "zib.de"
> 	> 3/4 15:48:32 (fd:11) --- End of ClassAd ---
> 	> 3/4 15:48:32 (fd:11) STARTER_TIMEOUT_MULTIPLIER is 
> undefined, using  
> 	> default value of 0
> 	> 3/4 15:48:32 (fd:11) New Daemon obj (shadow) name: 
> "onyx3.zib.de",  
> 	> pool: "NULL", addr: "NULL"
> 	> 3/4 15:48:32 (fd:11) Version of Shadow is $CondorVersion: 
> 	> 6.6.8 Jan 27  
> 	> 2005 $
> 	> 3/4 15:48:32 (fd:11) Starter communicating with 
> condor_shadow  
> 	> <130.73.68.82:21118>
> 	> 3/4 15:48:32 (fd:11) Submitting machine is "onyx3.zib.de"
> 	> 3/4 15:48:32 (fd:11) Doing CONDOR_register_starter_info
> 	> 3/4 15:48:32 (fd:11) ShouldTransferFiles is "NO", NOT 
> 	> transfering files
> 	> 3/4 15:48:32 (fd:11) Submit UidDomain: "zib.de"
> 	> 3/4 15:48:32 (fd:11)  Local UidDomain: "zib.de"
> 	> 3/4 15:48:32 (fd:11) Initialized user_priv as "..."
> 	> [ at this time the daemon crashes ]
> 	> --- End of log
> 	> 
> 	> --
> 	> Steffen Prohaska <prohaska@xxxxxx>  < 
> <http://www.zib.de/prohaska/> http://www.zib.de/prohaska/>
> 	> Zuse Institute Berlin, Takustraße 7, D-14195 
> Berlin-Dahlem, Germany
> 	> +49 (30) 841 85-337, fax -107
> 	> 1024D/DA749299 print 8B59 83A8 A43D E0E2 DEDB   D479 3157 
> 	> 2FEA DA74 9299
> 	> 
> 	
> 	
> 	_______________________________________________
> 	Condor-users mailing list
> 	Condor-users@xxxxxxxxxxx
> 	
> <https://lists.cs.wisc.edu/mailman/listinfo/condor-users> 
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> 
> 
> Thanks and Regards
> 
> P r a s h a n t  L a l
> 
> Cadence Design Systems
> 
> Noida Export Processing Zone,
> 
> Noida - 201301,
> 
> Phone:+91 120 2562842, extn 4009
> 
> Fax:+91 120 2562231
> 
> Cell:+91 98101-44168
> 
> mailto:  <mailto:lalp@xxxxxxxxxxx> lalp@ cadence.com
> 	
> 
> 
>