[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] schedd tripped over?



Hi again,

It is happening again right now: Let's focus on job 
9574196 (disabling line wrapping for clarity):

$ grep 9574196 SchedLog.old SchedLog
SchedLog.old:2/9 12:08:39 (pid:4294) Shadow pid 23192 for job 9574196.0 exited with status 4
SchedLog.old:2/9 12:08:43 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/9 12:08:43 (pid:4294) Started shadow for job 9574196.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.4.38:33481> for A@xxxxxxxxxxx, (shadow pid = 13068)
SchedLog.old:2/9 13:26:03 (pid:4294) Shadow pid 13068 for job 9574196.0 exited with status 4
SchedLog.old:2/9 13:26:03 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/9 13:26:03 (pid:4294) Started shadow for job 9574196.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.4.38:33481> for A@xxxxxxxxxxx, (shadow pid = 10220)
SchedLog.old:2/9 14:49:04 (pid:4294) Shadow pid 10220 for job 9574196.0 exited with status 4
SchedLog.old:2/9 14:49:04 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/9 14:49:04 (pid:4294) Started shadow for job 9574196.0 on slot3@xxxxxxxxxxxxxxxxx <10.10.4.38:33481> for A@xxxxxxxxxxx, (shadow pid = 23402)
SchedLog.old:2/9 17:27:09 (pid:4294) Shadow pid 23402 for job 9574196.0 exited with status 4
SchedLog.old:2/9 17:27:13 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/9 17:27:13 (pid:4294) Started shadow for job 9574196.0 on slot2@xxxxxxxxxxxxxxxxx <10.10.14.81:34304> for A@xxxxxxxxxxx, (shadow pid = 22718)
SchedLog.old:2/9 19:26:33 (pid:4294) OwnerCheck(B) failed in SetAttribute for job 9574196.0
SchedLog.old:2/9 20:06:46 (pid:4294) OwnerCheck(B) failed in SetAttribute for job 9574196.0
SchedLog.old:2/10 00:13:46 (pid:4294) Shadow pid 22718 for job 9574196.0 exited with status 4
SchedLog.old:2/10 00:13:53 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/10 00:13:54 (pid:4294) Started shadow for job 9574196.0 on slot2@xxxxxxxxxxxxxxxxx <10.10.13.82:59901> for A@xxxxxxxxxxx, (shadow pid = 29031)
SchedLog.old:2/10 01:43:49 (pid:4294) Shadow pid 29031 for job 9574196.0 exited with status 4
SchedLog.old:2/10 01:43:50 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/10 01:43:50 (pid:4294) Started shadow for job 9574196.0 on slot2@xxxxxxxxxxxxxxxxx <10.10.13.82:59901> for A@xxxxxxxxxxx, (shadow pid = 31025)
SchedLog.old:2/10 03:32:47 (pid:4294) Shadow pid 31025 for job 9574196.0 exited with status 4
SchedLog.old:2/10 03:32:47 (pid:4294) Match for cluster 9574196 has had 5 shadow exceptions, relinquishing.
SchedLog.old:2/10 03:32:47 (pid:4294) Match record (slot2@xxxxxxxxxxxxxxxxx <10.10.13.82:59901> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 03:42:59 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/10 03:42:59 (pid:4294) Started shadow for job 9574196.0 on slot4@xxxxxxxxxxxxxxxxx <10.10.9.39:54070> for A@xxxxxxxxxxx, (shadow pid = 1698)
SchedLog.old:2/10 05:20:48 (pid:4294) Shadow pid 1698 for job 9574196.0 exited with status 4
SchedLog.old:2/10 05:29:43 (pid:4294) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/10 05:29:43 (pid:4294) Started shadow for job 9574196.0 on slot4@xxxxxxxxxxxxxxxxx <10.10.9.39:54070> for A@xxxxxxxxxxx, (shadow pid = 24352)
SchedLog.old:2/10 06:10:28 (pid:4294) Shadow pid 24352 for job 9574196.0 exited with status 107
SchedLog.old:2/10 06:10:28 (pid:4294) Match record (slot4@xxxxxxxxxxxxxxxxx <10.10.9.39:54070> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:03:07 (pid:4294) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxx <10.10.16.2:57859> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:03:07 (pid:4294) Match record (slot1@xxxxxxxxxxxxxxxxx <10.10.16.2:57859> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:10:29 (pid:4294) Request was NOT accepted for claim slot2@xxxxxxxxxxxxxxxxx <10.10.5.64:45912> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:10:29 (pid:4294) Match record (slot2@xxxxxxxxxxxxxxxxx <10.10.5.64:45912> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:14:20 (pid:4294) Request was NOT accepted for claim slot4@xxxxxxxxxxxxxxxxx <10.10.7.18:43499> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:14:20 (pid:4294) Match record (slot4@xxxxxxxxxxxxxxxxx <10.10.7.18:43499> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:20:24 (pid:4294) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxx <10.10.6.83:41614> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:20:24 (pid:4294) Match record (slot1@xxxxxxxxxxxxxxxxx <10.10.6.83:41614> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:23:56 (pid:4294) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxx <10.10.4.84:32794> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:23:56 (pid:4294) Match record (slot1@xxxxxxxxxxxxxxxxx <10.10.4.84:32794> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:28:38 (pid:4294) Request was NOT accepted for claim slot2@xxxxxxxxxxxxxxxxx <10.10.4.74:52002> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:28:38 (pid:4294) Match record (slot2@xxxxxxxxxxxxxxxxx <10.10.4.74:52002> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 08:34:06 (pid:4294) Request was NOT accepted for claim slot3@xxxxxxxxxxxxxxxxx <10.10.9.67:52101> for A@xxxxxxxxxxx 9574196.0
SchedLog.old:2/10 08:34:06 (pid:4294) Match record (slot3@xxxxxxxxxxxxxxxxx <10.10.9.67:52101> for A@xxxxxxxxxxx, 9574196.0) deleted
SchedLog.old:2/10 09:28:59 (pid:9462) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/10 09:28:59 (pid:9462) Started shadow for job 9574196.0 on slot4@xxxxxxxxxxxxxxxxx <10.10.6.17:39940> for A@xxxxxxxxxxx, (shadow pid = 23579)
SchedLog.old:2/10 11:17:10 (pid:9462) Shadow pid 23579 for job 9574196.0 exited with status 4
SchedLog.old:2/10 11:17:10 (pid:9462) Starting add_shadow_birthdate(9574196.0)
SchedLog.old:2/10 11:17:10 (pid:9462) Started shadow for job 9574196.0 on slot4@xxxxxxxxxxxxxxxxx <10.10.6.17:39940> for A@xxxxxxxxxxx, (shadow pid = 6834)
SchedLog.old:2/10 18:15:19 (pid:9462) OwnerCheck(C) failed in SetAttribute for job 9574196.0
SchedLog.old:2/10 18:48:18 (pid:9462) OwnerCheck(C) failed in SetAttribute for job 9574196.0
SchedLog:2/11 13:13:03 (pid:9462) Shadow pid 6834 for job 9574196.0 exited with status 4

The ShadowLog tells this tale:
grep -A5 -B5  9574196 ShadowLog.old ShadowLog
ShadowLog-2/11 13:13:02 (9547479.0) (5738):Trying to unlink /local/condor.h2/spool/cluster9547479.proc0.subproc0.tmp
ShadowLog-2/11 13:13:02 (9573199.0) (3150):ERROR "Unable to talk to job: disconnected
ShadowLog-" at line 154 in file receivers.cpp
ShadowLog-2/11 13:13:02 (9573199.0) (3150):Shadow: DoCleanup: unlinking TmpCkpt '/local/condor.h2/spool/cluster9573199.proc0.subproc0.tmp'
ShadowLog-2/11 13:13:02 (9573199.0) (3150):Trying to unlink /local/condor.h2/spool/cluster9573199.proc0.subproc0.tmp
ShadowLog:2/11 13:13:02 (9574196.0) (6834):ERROR "Unable to talk to job: disconnected
ShadowLog-" at line 154 in file receivers.cpp
ShadowLog:2/11 13:13:02 (9574196.0) (6834):Shadow: DoCleanup: unlinking TmpCkpt '/local/condor.h2/spool/cluster9574196.proc0.subproc0.tmp'
ShadowLog:2/11 13:13:02 (9574196.0) (6834):Trying to unlink /local/condor.h2/spool/cluster9574196.proc0.subproc0.tmp
ShadowLog-2/11 13:13:59 (?.?) (7817):******* Standard Shadow starting up *******
ShadowLog-2/11 13:13:59 (?.?) (7817):** $CondorVersion: 7.2.0 Dec 25 2008 BuildID: X86_64-LINUX_DEBIAN40_ATLAS $
ShadowLog-2/11 13:13:59 (?.?) (7817):** $CondorPlatform: X86_64-LINUX_DEBIAN40 $
ShadowLog-2/11 13:13:59 (?.?) (7817):*******************************************
ShadowLog-2/11 13:13:59 (?.?) (7817):uid=0, euid=0, gid=0, egid=0

Any idea? I will probably stop/start condor again to get the jobs going again...

Cheers

Carsten