[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Help for problem debugging needed



Hi Robert,

Robert Futrick wrote:
> You can try looking at the ShadowLog ( present on the submit/schedd 
> machine the job 4205801.25 was scheduled on). The Shadow is the process 
> that the starter is attempting to communicate with, and might give more 
> information about the issue.

I'm copying all I could gather (omitting the user, ignoring 72+ characters/line):

egotiatorLog:10/16 11:35:26       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: PREEMPTION_REQUIREMENTS == False
NegotiatorLog:10/16 11:35:54       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: no match found
NegotiatorLog:10/16 11:35:55       Matched 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301> preempting none <10.10.2.92:50928> slot2@xxxxxxxxxxxxxxxxx
NegotiatorLog:10/18 01:58:54       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: no match found
NegotiatorLog:10/18 01:58:57       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: PREEMPTION_REQUIREMENTS == False
NegotiatorLog:10/18 01:59:36       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: no match found
NegotiatorLog:10/18 01:59:39       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: PREEMPTION_REQUIREMENTS == False
NegotiatorLog:10/18 02:00:35       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: no match found
NegotiatorLog:10/18 02:00:38       Rejected 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301>: PREEMPTION_REQUIREMENTS == False
NegotiatorLog:10/18 02:56:21       Matched 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301> preempting none <10.10.7.18:56735> slot1@xxxxxxxxxxxxxxxxx
NegotiatorLog:10/18 06:34:16       Matched 4205801.25 XXX@xxxxxxxxxxx <10.20.30.2:37301> preempting none <10.10.8.37:53203> slot4@xxxxxxxxxxxxxxxxx
SchedLog.old:10/17 11:44:10 (pid:11045) OwnerCheck(YYY) failed in SetAttribute for job 4205801.25
SchedLog.old:10/17 17:05:30 (pid:11045) OwnerCheck(ZZZ) failed in SetAttribute for job 4205801.25
SchedLog.old:10/17 17:05:30 (pid:11045) OwnerCheck(ZZZ) failed in SetAttribute for job 4205801.25
SchedLog.old:10/17 17:05:34 (pid:11045) OwnerCheck(ZZZ) failed in SetAttribute for job 4205801.25
SchedLog.old:10/18 02:25:33 (pid:11045) Starting add_shadow_birthdate(4205801.25)
SchedLog.old:10/18 02:25:33 (pid:11045) Started shadow for job 4205801.25 on "<10.10.12.73:53004>", (shadow pid = 20484)
SchedLog.old:10/18 02:25:34 (pid:11045) Shadow pid 20484 for job 4205801.25 exited with status 108
SchedLog.old:10/18 03:17:54 (pid:11045) Starting add_shadow_birthdate(4205801.25)
SchedLog.old:10/18 03:17:54 (pid:11045) Started shadow for job 4205801.25 on "<10.10.7.18:56735>", (shadow pid = 18449)
SchedLog.old:10/18 06:23:06 (pid:11045) Starting add_shadow_birthdate(4205801.25)
SchedLog.old:10/18 06:23:06 (pid:11045) Started shadow for job 4205801.25 on "<10.10.1.20:38771>", (shadow pid = 11212)
SchedLog.old:10/18 06:33:06 (pid:11045) Shadow pid 11212 for job 4205801.25 exited with status 108
SchedLog.old:10/18 06:34:20 (pid:11045) Starting add_shadow_birthdate(4205801.25)
SchedLog.old:10/18 06:34:20 (pid:11045) Started shadow for job 4205801.25 on "<10.10.8.37:53203>", (shadow pid = 30738)
ShadowLog:10/18 10:02:33 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 10:17:40 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 10:32:40 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 10:47:42 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 11:02:42 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 11:17:43 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 11:32:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 11:47:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 12:02:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 12:17:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 12:32:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 12:47:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 13:02:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 13:17:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 13:32:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 13:47:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 14:02:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 14:17:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 14:32:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog:10/18 14:47:47 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 06:23:06 Initializing a VANILLA shadow for job 4205801.25
ShadowLog.old:10/18 06:23:06 (4205801.25) (11212): Request to run on <10.10.1.20:38771> was REFUSED
ShadowLog.old:10/18 06:23:06 (4205801.25) (11212): Job 4205801.25 is being evicted
ShadowLog.old:10/18 06:23:06 (4205801.25) (11212): logEvictEvent with unknown reason (108), aborting
ShadowLog.old:10/18 06:24:50 (4205801.25) (11212): ZKM: setting default map to (null)
ShadowLog.old:10/18 06:24:50 (4205801.25) (11212): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 108
ShadowLog.old:10/18 06:34:20 Initializing a VANILLA shadow for job 4205801.25
ShadowLog.old:10/18 06:34:20 (4205801.25) (30738): Request to run on <10.10.8.37:53203> was ACCEPTED
ShadowLog.old:10/18 06:52:30 (4205801.25) (30738): attempt to connect to <10.20.30.2:53100> failed: Connection timed out (connect errno = 110).  Will keep trying for 300 total seconds (111 to go).
ShadowLog.old:10/18 06:56:55 (4205801.25) (30738): condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
ShadowLog.old:10/18 06:56:55 (4205801.25) (30738): IO: Failed to read packet header
ShadowLog.old:10/18 07:02:06 (4205801.25) (30738): condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
ShadowLog.old:10/18 07:02:06 (4205801.25) (30738): IO: Failed to read packet header
ShadowLog.old:10/18 07:02:06 (4205801.25) (30738): AUTHENTICATE: handshake failed!
ShadowLog.old:10/18 07:02:06 (4205801.25) (30738): Authentication Error: AUTHENTICATE:1002:Failure performing handshake
ShadowLog.old:10/18 07:17:06 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 07:32:06 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 07:47:13 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 08:02:25 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 08:17:26 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 08:32:26 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 08:47:26 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 09:02:27 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 09:17:32 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 09:32:32 (4205801.25) (30738): ZKM: setting default map to (null)
ShadowLog.old:10/18 09:47:33 (4205801.25) (30738): ZKM: setting default map to (null)

Albeit, this does not help me much, anyone with more experience?

Thanks a lot!

Carsten