[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] starter process exits



I spent additional time tracing the startd process and discover this error 
was of my own making.   

Even when I fixed the error below by setting CONDOR_CONFIG explicitly, I 
was still seeing starter processes existing with code 44.  Strace revealed 
they were becoming the wrong uid/gid before trying to write to 
~/condor/log.  This caused permission denied errors and death.

Once I noticed the incorrect uid/gid I was puzzled.  I define my users 
across machines via LDAP, so wasn't sure where it was getting false info. 

Well, it seems that I had left a stray condor account definition in the 
/etc/passwd file of my initrd images.  The nscd was returning the local 
/etc/passwd hit at times and other times it was returning the LDAP posix 
entry.  Needless to say, this lead to unpredictable results.  

An odd thing was that nscd would return the LDAP entry after I ls 
/home/condor but would eventually fall back to the /etc/passwd entry.  
This probably had to do with its cache mgmt and ls somehow forcing the 
uid-to-user lookup of the owner of /home/condor (the "correct" condor 
account).

Anyway,  thanks for the resonse and sorry I didn't catch this before 
posting.  I wanted to add the extra details to round out the discussion 
and show that it really wasn't mysterious in the end.

~jpr

On Wed, 19 Jan 2005, Tim Robertson wrote:

> > The processes are
> > started as root but run as the user condor. The execute nodes get their
> > /home/condor served up by NFS and the dirs are auto-mounted.  My global
> > condor_config is in /opt/condor/etc/condor_config and there is a 
> > symlink
> > from /home/condor/condor_config to this file.
> 
> It sounds like you may have a problem in your config setup -- you 
> realize that there are two config files, right?
> 
> The main config file is in /opt/condor/etc, and the other in 
> /opt/condor/local.X, where X is the name of your machine.  You're 
> supposed to set the CONDOR_CONFIG environment variable to point to the 
> location of the main config file (the one in /opt/condor/etc), which, 
> in turn, has a macro pointing to the local config file.  By default, 
> you need both for condor to start properly.
> 
> > I'm seeing some strange behavior both when I start up condor_master and
> > when I submit jobs to the pool.  In the case of condor_master, if I 
> > start
> > this process without first doing an 'ls /home/condor' it dies with a
> > complaint about not having CONDOR_CONFIG set, not being able to find
> > /etc/condor/condor_config, or not being able to find
> > /local/condor/condor_config.  The complaint also mentions not finding
> > ~/condor.  When I trace the condor_master with strace, however, it 
> > doesn't
> > look like an open() attempt is ever made on ~/condor_config.  
> > Eventhough
> > df shows /home/condor as already mounted, if I 'ls /home/condor', 
> > however,
> > it succeeds in checking for and finding this directory.  It seems 
> > there is
> > some reason condor is not even attempting to open
> > /home/condor/condor_config.
> 
> I don't know why condor would start properly once you've listed the 
> /home/condor directory.  If your CONDOR_CONFIG variable is set 
> correctly, and your local config file is in the right place, condor 
> should start.
> 
> If you have everything correctly set up, perhaps it's a problem with 
> your NFS automounting configuration -- I've seen similar problems 
> before, but never on my own systems.
> 
> Best,
> Tim
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>