[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] SCHEDD daemon owned by a user



Hi,

For some reason, condor_schedd is now owned by a user rather than condor
on 5 of our 8 systems that allow job submission by our users.  This just
started happening about a couple weeks ago, and is obviously causing
havoc on those nodes, preventing job submissions and condor_q from
working.  We've been using condor 7.2.1 now for a couple months and prior
to this everything was running just fine.

The command 'condor_off' doesn't seem to do anything on the affected nodes,
so even when killing off the condor daemons (condor_master, condor_schedd,
condor_procd) with a 'kill -9 <pid>' and restarting with 'service condor start',
the condor_schedd *still* shows up being owned by the same user.  Even a
reboot of the system doesn't seem to clear the problem.

On the other 3 submission nodes, everything is still behaving properly.
I even tried stopping and restarting condor on one of those nodes and
condor_schedd is still happily owned by condor.

Is this a bug in 7.2.1?
Any ideas on how to fix or where to look for the problem?

Thanks!
- Larry


Further info:
Our environment is simple and straightforward.

Master node:
	RedHat RHEL_3, fully patched, runs COLLECTOR, MASTER, SCHEDD,
	and NEGOTIATOR

Job submission nodes:
	8 identically configured Scientific Linux 4.7 rack-mount systems,
	fully patched, running MASTER and SCHEDD.  Users log in to these
	systems to submit jobs to the computation nodes.

Computation nodes:
	16 identically configured Scientific Linux 4.7 rack-mount systems,
	fully patched, running MASTER and STARTD

All the condor systems communicate on a private, non-routable LAN, and we
use NIS for passwords and NFS for file sharing.  The local config file on
all nodes defers to the master config file except for the IP address and
which daemons to run.

The master config file is pretty much out-of-the-box except we're using
TESTINGMODE instead of UWCS.

The user who owns condor_schedd on 5 of the submission nodes does *not* have
root permissions.  The same user owns the condor_schedd on the 5 affected nodes.
The same user can successfully submit jobs on the nodes not affected.

--
P. Larry Nelson (217-244-9855) | Systems/Network Administrator
461 Loomis Lab                 | High Energy Physics Group
1110 W. Green St., Urbana, IL  | Physics Dept., Univ. of Ill.
MailTo:lnelson@xxxxxxxx        | http://www.roadkill.com/lnelson/
-------------------------------------------------------------------
 "Information without accountability is just noise."  - P.L. Nelson