[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] starter process exits



Hi,

I've set up a condor pool (6.6.7) on our desktop x86 boxes with a fc3
master and suse9.1 execute nodes (kernel 2.6.4).  The systems share an nfs
file system and have a common uid_domain. I'm able to start the condor
processes (already fixed the /proc/meminfo problem).  The processes are
started as root but run as the user condor. The execute nodes get their
/home/condor served up by NFS and the dirs are auto-mounted.  My global 
condor_config is in /opt/condor/etc/condor_config and there is a symlink 
from /home/condor/condor_config to this file.

I'm seeing some strange behavior both when I start up condor_master and
when I submit jobs to the pool.  In the case of condor_master, if I start
this process without first doing an 'ls /home/condor' it dies with a
complaint about not having CONDOR_CONFIG set, not being able to find
/etc/condor/condor_config, or not being able to find
/local/condor/condor_config.  The complaint also mentions not finding
~/condor.  When I trace the condor_master with strace, however, it doesn't
look like an open() attempt is ever made on ~/condor_config.  Eventhough
df shows /home/condor as already mounted, if I 'ls /home/condor', however,
it succeeds in checking for and finding this directory.  It seems there is 
some reason condor is not even attempting to open 
/home/condor/condor_config.

This trouble follows me to the startd process.  If I submit a job and 
monitor the StartLog, it frequently shows that starter exited with status 
1. If I strace on startd, i find that the child is dieing for the same 
reason mentioned above.  Again, no attempt is even made to check 
/home/condor/condor_config.  It just tries to open the first two. So there 
seems to be a problem with condor not wanting to check for this file.

I can "fix" this problem by either doing the ls or setting CONDOR_CONFIG 
explicitly. 

Does anyone have insights into this?