[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_schedd running under the wrong user



On Jan 31, 2007, at 8:08 AM, Jamie Rollins wrote:

Hi, folks. I have a question that I hope someone may have some insite into.

The other day, for a reason that is unknown to me, the condor_schedd daemon running on the central manager stopped running as the user "condor", which I believe it had been previously (and which all the other daemons are running as), and started running instead as a different user (user "x", say, who uses the pool frequently). This caused the condor_schedd daemon to freeze, presumably because it couldn't write to any of it's log/execute/spool files which are only writable as the "condor" user. Although this issue seems to be coincident with an upgrade of the domain/LDAP controller and nfs home directory server, I'm not convinced that they're related (everything else seems to be working ok).

I was able to make the problem go away for a bit by killing all the daemons and flushing the spool directory for the central manger, then restarting the daemons. After that the schedd daemon started up as "condor". However, after a while, and some more use by user "x", the schedd daemon mysteriously started running as user "x" again. I can't find anything in the logs that would
indicate how, when, or why this change may have happened.

Parenthetically, I'm having trouble figuring out how the daemons are determining what user to run as to begin with. The condor_master is started as root, but then immediately starts running as user "condor". All the sub- daemons run as "condor" (collector, negotiator, startd), except the schedd, which mysteriously runs as user "x" (unless the spool directory has been cleared). No where in the configuration files do I specify that the daemons run as user "condor".

I found a mail to this list from last May where a user ('rok') describes what
appears to be a very similar problem (see attached message below).
Unfortunately there weren't any replies. Has anyone else ever experienced anything like this? Rok, did you ever get the issue resolved, or figure out what was causing it? Any thoughts at all would be very much appreciated.

The schedd starts life as root, then switches its effective uid to 'condor' for most of its life. It switches to users' uids temporarily to perform actions as the users (access job files, starting scheduler universe jobs, etc.). What's probably happening is that the schedd is freezing in the middle of one of these operations. Problems talking to the nfs server could easily cause this.

Could you set the following in your Condor config file and then send us the end of the schedd log the next time this happens:
SCHEDD_DEBUG = D_FULLDEBUG D_COMMAND

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+