[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] SCHEDD daemon owned by a user
- Date: Thu, 09 Jul 2009 16:03:50 -0500
- From: "P. Larry Nelson" <lnelson@xxxxxxxx>
- Subject: Re: [Condor-users] SCHEDD daemon owned by a user
That nailed the problem.
I didn't bother with the gdb - I just went ahead and deleted the old
job_queue.log, did a 'touch job_queue.log', made sure ownership
and file permissions were correct, killed and restarted condor,
and all is fine again!
Dan Bradley wrote on 7/9/2009 3:27 PM:
Perhaps the schedd is trying to access a file (like a"user log" file)
which requires that it run temporarily as the user. If access to the
file in question blocks the schedd (e.g. because of filesystem issues),
then the schedd could get stuck in this state.
To find out if that is the case, attach to the schedd with gdb and see
what the call stack looks like. Example:
gdb -p <pid of schedd>
I have heard of cases where this sort of situation required stopping the
schedd and hand-editing the schedd's job_queue.log (which lives in the
condor spool directory) to remove references to a problematic file.
P. Larry Nelson wrote:
For some reason, condor_schedd is now owned by a user rather than condor
on 5 of our 8 systems that allow job submission by our users. This just
started happening about a couple weeks ago, and is obviously causing
havoc on those nodes, preventing job submissions and condor_q from
working. We've been using condor 7.2.1 now for a couple months and prior
to this everything was running just fine.
The command 'condor_off' doesn't seem to do anything on the affected nodes,
so even when killing off the condor daemons (condor_master, condor_schedd,
condor_procd) with a 'kill -9 <pid>' and restarting with 'service condor start',
the condor_schedd *still* shows up being owned by the same user. Even a
reboot of the system doesn't seem to clear the problem.
On the other 3 submission nodes, everything is still behaving properly.
I even tried stopping and restarting condor on one of those nodes and
condor_schedd is still happily owned by condor.
Is this a bug in 7.2.1?
Any ideas on how to fix or where to look for the problem?
Our environment is simple and straightforward.
RedHat RHEL_3, fully patched, runs COLLECTOR, MASTER, SCHEDD,
Job submission nodes:
8 identically configured Scientific Linux 4.7 rack-mount systems,
fully patched, running MASTER and SCHEDD. Users log in to these
systems to submit jobs to the computation nodes.
16 identically configured Scientific Linux 4.7 rack-mount systems,
fully patched, running MASTER and STARTD
All the condor systems communicate on a private, non-routable LAN, and we
use NIS for passwords and NFS for file sharing. The local config file on
all nodes defers to the master config file except for the IP address and
which daemons to run.
The master config file is pretty much out-of-the-box except we're using
TESTINGMODE instead of UWCS.
The user who owns condor_schedd on 5 of the submission nodes does *not* have
root permissions. The same user owns the condor_schedd on the 5 affected nodes.
The same user can successfully submit jobs on the nodes not affected.
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at:
P. Larry Nelson (217-244-9855) | Systems/Network Administrator
461 Loomis Lab | High Energy Physics Group
1110 W. Green St., Urbana, IL | Physics Dept., Univ. of Ill.
MailTo:lnelson@xxxxxxxx | http://www.roadkill.com/lnelson/
"Information without accountability is just noise." - P.L. Nelson