[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD daemon owned by a user



Thanks Dan!!
That nailed the problem.

I didn't bother with the gdb - I just went ahead and deleted the old
job_queue.log, did a 'touch job_queue.log', made sure ownership
and file permissions were correct, killed and restarted condor,
and all is fine again!

- Larry

Dan Bradley wrote on 7/9/2009 3:27 PM:
Perhaps the schedd is trying to access a file (like a"user log" file) which requires that it run temporarily as the user. If access to the file in question blocks the schedd (e.g. because of filesystem issues), then the schedd could get stuck in this state.

To find out if that is the case, attach to the schedd with gdb and see what the call stack looks like. Example:

gdb -p <pid of schedd>
(gdb) where
(gdb) quit

I have heard of cases where this sort of situation required stopping the schedd and hand-editing the schedd's job_queue.log (which lives in the condor spool directory) to remove references to a problematic file.

--Dan

P. Larry Nelson wrote:
Hi,

For some reason, condor_schedd is now owned by a user rather than condor
on 5 of our 8 systems that allow job submission by our users.  This just
started happening about a couple weeks ago, and is obviously causing
havoc on those nodes, preventing job submissions and condor_q from
working.  We've been using condor 7.2.1 now for a couple months and prior
to this everything was running just fine.

The command 'condor_off' doesn't seem to do anything on the affected nodes,
so even when killing off the condor daemons (condor_master, condor_schedd,
condor_procd) with a 'kill -9 <pid>' and restarting with 'service condor start',
the condor_schedd *still* shows up being owned by the same user.  Even a
reboot of the system doesn't seem to clear the problem.

On the other 3 submission nodes, everything is still behaving properly.
I even tried stopping and restarting condor on one of those nodes and
condor_schedd is still happily owned by condor.

Is this a bug in 7.2.1?
Any ideas on how to fix or where to look for the problem?

Thanks!
- Larry


Further info:
Our environment is simple and straightforward.

Master node:
	RedHat RHEL_3, fully patched, runs COLLECTOR, MASTER, SCHEDD,
	and NEGOTIATOR

Job submission nodes:
	8 identically configured Scientific Linux 4.7 rack-mount systems,
	fully patched, running MASTER and SCHEDD.  Users log in to these
	systems to submit jobs to the computation nodes.

Computation nodes:
	16 identically configured Scientific Linux 4.7 rack-mount systems,
	fully patched, running MASTER and STARTD

All the condor systems communicate on a private, non-routable LAN, and we
use NIS for passwords and NFS for file sharing.  The local config file on
all nodes defers to the master config file except for the IP address and
which daemons to run.

The master config file is pretty much out-of-the-box except we're using
TESTINGMODE instead of UWCS.

The user who owns condor_schedd on 5 of the submission nodes does *not* have
root permissions.  The same user owns the condor_schedd on the 5 affected nodes.
The same user can successfully submit jobs on the nodes not affected.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/


--
P. Larry Nelson (217-244-9855) | Systems/Network Administrator
461 Loomis Lab                 | High Energy Physics Group
1110 W. Green St., Urbana, IL  | Physics Dept., Univ. of Ill.
MailTo:lnelson@xxxxxxxx        | http://www.roadkill.com/lnelson/
-------------------------------------------------------------------
 "Information without accountability is just noise."  - P.L. Nelson