[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] strange schedd problem



Thanks, Erik!

Erik Paulson wrote:
On 11/11/07, Maxim kovgan <maxim.kvg@xxxxxxxxx> wrote:
Hello, list.

We are using "stock" condor 6.8.6 version.

2 central managers are in HAD couple,
one is SL3, other Debian Etch. each running the correct version for its
arch, and distribution.

We have 2 schedd's.
One is used for production, another is only a backup.

We are currently experiencing the following problem:

the main schedd has started running under the user of the user that uses
the pool the most, i.e. has the highest value of "Effective Priority"

<...>

This doesn't have anything to do with effective priority, that's just
a coincidence.


OK.


The interesting thing is that in the mailing list archives I've found a
very similar problem to mine:
https://lists.cs.wisc.edu/archive/condor-users/2007-January/msg00201.shtml

Unfortunately it refers to a yet another similar report, and without a
satisfying solution:

The "solution" was:
"flushing the spool directory for the central manger"


Is there a solution for such symptom, except deleting the spool ?



Do you have NFS involved somewhere? The schedd will temporarily switch
to a different user to write log files or copy data around, if
something blocks it, it could get stuck waiting to finish that
operation. Can you shut the schedd down? If you start it back up, does
the problem go away?

yes, we have NFS, but not in /var/condor/spool or such. user's files are on NFS.
can I shut schedd ? yes.
if I start it back the problem doesn't go away. no.


You need to tell us more about how you're observing this. Are you
looking at top output, or ps, or the log files? How long has it been
happening for now?

for sev. days, the symptoms started as some jobs accessing WRONG home directory files of OTHER users.


What does
ps axo cmd,pid,ppid,args,euid,ruid,fuid

print out?


1. ps axo cmd,pid,ppid,args,euid,ruid,fuid | less, cut off relevant parts:

-bash 13734 13730 -bash 0 0 0 /usr/local/condor/debian/co 13839 13734 /usr/local/condor/debian/co 40137 0 40137 <--- this was launched manually, also became other user. condor_master 13844 1 condor_master 10008 0 10008 condor_schedd -f 13845 13844 condor_schedd -f 40137 0 40137


2. tried to launch the schedd manually, via debian tool:
start-stop-daemon -u condor -S --exec /usr/local/condor/debian/condor-6.8.6/sbin/condor_schedd -- -f &

via top have seen the process keep the PID, but change the UID

This is wierd.

3. SchedLog shows not much:
11/12 03:00:21 ******************************************************
11/12 03:00:21 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
11/12 03:00:21 ** /usr/local/condor/debian/condor-6.8.6/sbin/condor_schedd
11/12 03:00:21 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/12 03:00:21 ** $CondorPlatform: I386-LINUX_DEBIAN40 $
11/12 03:00:21 ** PID = 13882
11/12 03:00:21 ** Log last touched 11/12 02:52:44
11/12 03:00:21 ******************************************************
11/12 03:00:21 Using config source: /usr/local/condor/etc/condor_config
11/12 03:00:21 Using local config sources:
11/12 03:00:21    /usr/local/condor/etc/ds-sub0.local
11/12 03:00:21 DaemonCore: Command Socket at <xxx.yyy.zzz.229:11773>
11/12 03:00:21 History file rotation is enabled.
11/12 03:00:21   Maximum history file size is: 20971520 bytes
11/12 03:00:21   Number of rotated history files is: 2
===== here it stalls



4. Latest entries into ShadowLog:
============ ShadowLog snapshot start ================
11/11 23:18:31 (10084.265) (11588): condor_read(): timeout reading 5 bytes from <xxx.yyy.zzz.229:10651>.
11/11 23:18:31 (10084.265) (11588): AUTHENTICATE: handshake failed!
11/11 23:18:31 (10084.265) (11588): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
11/11 23:18:31 (10084.265) (11588): Failed to update job queue!
11/11 23:18:31 (10084.265) (11588): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107 11/11 23:19:04 (10084.297) (11809): FileLock::obtain(1) failed - errno 5 (Input/output error) 11/11 23:19:31 (10084.259) (11520): condor_read(): timeout reading 5 bytes from <xxx.yyy.zzz.229:10651>.
11/11 23:19:31 (10084.259) (11520): AUTHENTICATE: handshake failed!
11/11 23:19:31 (10084.259) (11520): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
11/11 23:19:31 (10084.259) (11520): Failed to update job queue!
11/11 23:19:31 (10084.259) (11520): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107 11/11 23:20:04 (10084.282) (11624): FileLock::obtain(1) failed - errno 5 (Input/output error) 11/11 23:20:34 (10084.191) (10959): condor_read(): timeout reading 5 bytes from <xxx.yyy.zzz.229:10651>.
11/11 23:20:34 (10084.191) (10959): AUTHENTICATE: handshake failed!
11/11 23:20:34 (10084.191) (10959): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
11/11 23:20:34 (10084.191) (10959): Failed to update job queue!
11/11 23:20:34 (10084.191) (10959): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107 11/11 23:21:04 (10084.280) (11604): FileLock::obtain(1) failed - errno 5 (Input/output error) 11/11 23:21:31 (10084.197) (10794): condor_read(): timeout reading 5 bytes from <xxx.yyy.zzz.229:10651>.
11/11 23:21:31 (10084.197) (10794): AUTHENTICATE: handshake failed!
11/11 23:21:31 (10084.197) (10794): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
11/11 23:21:31 (10084.197) (10794): Failed to update job queue!
11/11 23:21:31 (10084.197) (10794): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 107 11/11 23:22:04 (10084.266) (11504): FileLock::obtain(1) failed - errno 5 (Input/output error) 11/11 23:23:04 (10084.279) (11625): FileLock::obtain(1) failed - errno 5 (Input/output error) 11/11 23:23:40 (10084.268) (11481): FileLock::obtain(1) failed - errno 5 (Input/output error) 11/11 23:24:40 (10084.287) (11736): FileLock::obtain(1) failed - errno 5 (Input/output error)
(END)
============== ShadowLog snapshot end ==================================

This only confirms the schedd was stuck on wrong UID for a while quite now. if I chmod the spool/log files to the wrong user, the schedd will start "working".


============== MasterLog snapshot start ================================
11/12 02:52:44 ******************************************************
11/12 02:52:44 ** condor_master (CONDOR_MASTER) STARTING UP
11/12 02:52:44 ** /usr/local/condor/debian/condor-6.8.6/sbin/condor_master
11/12 02:52:44 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/12 02:52:44 ** $CondorPlatform: I386-LINUX_DEBIAN40 $
11/12 02:52:44 ** PID = 13844
11/12 02:52:44 ** Log last touched 11/12 02:48:20
11/12 02:52:44 ******************************************************
11/12 02:52:44 Using config source: /usr/local/condor/etc/condor_config
11/12 02:52:44 Using local config sources:
11/12 02:52:44    /usr/local/condor/etc/ds-sub0.local
11/12 02:52:44 Running as root.  Enabling specialized core dump routines
11/12 02:52:44 Attempting to lock /var/condor/log/InstanceLock.
11/12 02:52:44 Obtained lock on /var/condor/log/InstanceLock.
11/12 02:52:44 DaemonCore: Command Socket at <xxx.yyy.zzz.229:12447>
11/12 02:52:44 Will use UDP to update collector master1.mydomain <xxx.yyy.zzz.235:9618> 11/12 02:52:44 Will use UDP to update collector master0.mydomain <xxx.yyy.zzz.231:9618>
11/12 02:52:44 ::RealStart; SCHEDD on_hold=0
11/12 02:52:44 Started DaemonCore process "/usr/local/condor/debian/condor-6.8.6/sbin/condor_schedd", pid and pgroup = 13845
11/12 02:52:44 Getting monitoring info for pid 13844
11/12 02:52:49 enter Daemons::UpdateCollector
11/12 02:52:49 Trying to update collector <xxx.yyy.zzz.235:9618>
11/12 02:52:49 Attempting to send update via UDP to collector master1.mydomain <xxx.yyy.zzz.235:9618>
11/12 02:52:49 Trying to update collector <xxx.yyy.zzz.231:9618>
11/12 02:52:49 Attempting to send update via UDP to collector master0.mydomain <xxx.yyy.zzz.231:9618>
11/12 02:52:49 exit Daemons::UpdateCollector
11/12 02:52:49 enter Daemons::CheckForNewExecutable
11/12 02:52:49 Time stamp of running /usr/local/condor/debian/condor-6.8.6/sbin/condor_master: 1189673623
11/12 02:52:49 GetTimeStamp returned: 1189673623
11/12 02:52:49 Time stamp of running /usr/local/condor/debian/condor-6.8.6/sbin/condor_schedd: 1189673601
11/12 02:52:49 GetTimeStamp returned: 1189673601
11/12 02:52:49 exit Daemons::CheckForNewExecutable
11/12 02:52:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:53:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:54:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:55:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:56:44 Getting monitoring info for pid 13844
11/12 02:56:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:57:49 enter Daemons::UpdateCollector
11/12 02:57:49 Trying to update collector <xxx.yyy.zzz.235:9618>
11/12 02:57:49 Attempting to send update via UDP to collector master1.mydomain <xxx.yyy.zzz.235:9618>
11/12 02:57:49 Trying to update collector <xxx.yyy.zzz.231:9618>
11/12 02:57:49 Attempting to send update via UDP to collector master0.mydomain <xxx.yyy.zzz.231:9618>
11/12 02:57:49 exit Daemons::UpdateCollector
11/12 02:57:49 enter Daemons::CheckForNewExecutable
11/12 02:57:49 Time stamp of running /usr/local/condor/debian/condor-6.8.6/sbin/condor_master: 1189673623
11/12 02:57:49 GetTimeStamp returned: 1189673623
11/12 02:57:49 Time stamp of running /usr/local/condor/debian/condor-6.8.6/sbin/condor_schedd: 1189673601
11/12 02:57:49 GetTimeStamp returned: 1189673601
11/12 02:57:49 exit Daemons::CheckForNewExecutable
11/12 02:57:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:58:59 ProcAPI::buildFamily() Found daddypid on the system: 13845
11/12 02:59:59 ProcAPI::buildFamily() Found daddypid on the system: 13845

============== MasterLog snapshot end  ================================


I have another (backup) schedd, and this doesn't occur on it.
It seems to run as nice condor user.







Log files would be helpful here. Shut the schedd down, and start it
back up, and send or post the log file somewhere. It's important that
the logfile include the time when the schedd starts.


OK, What exactly causes the schedd to "get stuck" on wrong UID ?

Since I have tried to launch schedd manually, and it began as the correct UID, and then switched to wrong one by itself, I understand that this is schedd's "fault".





-Erik
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/


--
Maxim Kovgan,
Distributed Systems and Data Mining Laboratory
Computer Science, Technion
http://dsl.cs.technion.ac.il