[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] [Stanislav.Babak@xxxxxxxxxx: [Vulcan-users] Not again....]



I received a complaint from a user:

> ------------------
> 007 (705.000.000) 09/24 15:52:13 Shadow exception!
>         Can no longer talk to condor_starter <10.150.91.3:43647>
>         0  -  Run Bytes Sent By Job
>         147550  -  Run Bytes Received By Job
> ------------------
> And my jobs are idle.... :(

Checking the logs of the corresponding node (as I don't know which slot
was running the job, I have to query all logs):

/var/log/condor# grep 24_15:52 *
StartdLog:14-09-24_15:52:09 Calling Handler <Alive to schedd <10.150.100.40:59305>> (4)
StartdLog:14-09-24_15:52:09 Return from Handler <Alive to schedd <10.150.100.40:59305>> 0.0001s
StartdLog:14-09-24_15:52:09 Calling Handler <Alive to schedd <10.150.100.40:59305>> (4)
StartdLog:14-09-24_15:52:09 Return from Handler <Alive to schedd <10.150.100.40:59305>> 0.0000s
StartdLog:14-09-24_15:52:20 slot2: State change: claim lease expired (condor_schedd gone?)
StartdLog:14-09-24_15:52:20 slot2: Changing state and activity: Claimed/Busy -> Preempting/Killing
StartdLog:14-09-24_15:52:20 Calling Handler <receiveJobClassAdUpdate> (3)
StartdLog:14-09-24_15:52:20 Return from Handler <receiveJobClassAdUpdate> 0.0001s
StartdLog:14-09-24_15:52:20 Calling Handler <receiveJobClassAdUpdate> (3)
StartdLog:14-09-24_15:52:20 Return from Handler <receiveJobClassAdUpdate> 0.0000s
StartdLog:14-09-24_15:52:20 DaemonCore: pid 2084 exited with status 0, invoking reaper 5 <reaper>
StartdLog:14-09-24_15:52:20 Starter pid 2084 exited with status 0
StartdLog:14-09-24_15:52:20 slot2: State change: starter exited
StartdLog:14-09-24_15:52:20 slot2: State change: No preempting claim, returning to owner
StartdLog:14-09-24_15:52:20 slot2: Changing state and activity: Preempting/Killing -> Owner/Idle
StartdLog:14-09-24_15:52:20 slot2: State change: IS_OWNER is false
StartdLog:14-09-24_15:52:20 slot2: Changing state: Owner -> Unclaimed
StartdLog:14-09-24_15:52:20 DaemonCore: return from reaper for pid 2084
StartdLog:14-09-24_15:52:20 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (3)
StartdLog:14-09-24_15:52:20 Received TCP command 404 (DEACTIVATE_CLAIM_FORCIBLY) from unauthenticated@unmapped <10.150.100.40:46464>, access level DAEMON
StartdLog:14-09-24_15:52:20 Calling HandleReq <command_handler> (0) for command 404 (DEACTIVATE_CLAIM_FORCIBLY) from unauthenticated@unmapped <10.150.100.40:46464>
StartdLog:14-09-24_15:52:21 slot1: Called deactivate_claim_forcibly()
StartdLog:14-09-24_15:52:21 slot1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
StartdLog:14-09-24_15:52:21 Return from HandleReq <command_handler> (handler: 0.204s, sec: 0.000s, payload: 0.000s)
StartdLog:14-09-24_15:52:21 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.2043s
StartdLog:14-09-24_15:52:21 Received TCP command 404 (DEACTIVATE_CLAIM_FORCIBLY) from unauthenticated@unmapped <10.150.100.40:57020>, access level DAEMON
StartdLog:14-09-24_15:52:21 Calling HandleReq <command_handler> (0) for command 404 (DEACTIVATE_CLAIM_FORCIBLY) from unauthenticated@unmapped <10.150.100.40:57020>
StartdLog:14-09-24_15:52:21 Error: can't find resource with ClaimId (<10.150.91.3:43647>#1410524749#46#...) -- perhaps this claim was already removed?
StartdLog:14-09-24_15:52:21 Error: problem finding resource for 404 (DEACTIVATE_CLAIM_FORCIBLY)
StartdLog:14-09-24_15:52:21 Return from HandleReq <command_handler> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
StartdLog:14-09-24_15:52:21 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (3)
StartdLog:14-09-24_15:52:21 Received TCP command 404 (DEACTIVATE_CLAIM_FORCIBLY) from unauthenticated@unmapped <10.150.100.40:36617>, access level DAEMON
StartdLog:14-09-24_15:52:21 Calling HandleReq <command_handler> (0) for command 404 (DEACTIVATE_CLAIM_FORCIBLY) from unauthenticated@unmapped <10.150.100.40:36617>
StartdLog:14-09-24_15:52:21 slot3: Called deactivate_claim_forcibly()
StartdLog:14-09-24_15:52:21 slot3: Changing state and activity: Claimed/Busy -> Preempting/Vacating
StartdLog:14-09-24_15:52:21 Return from HandleReq <command_handler> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
StartdLog:14-09-24_15:52:21 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0002s
StartdLog:14-09-24_15:52:21 Calling Handler <receiveJobClassAdUpdate> (2)
StartdLog:14-09-24_15:52:21 Return from Handler <receiveJobClassAdUpdate> 0.0001s
StartdLog:14-09-24_15:52:21 Calling Handler <receiveJobClassAdUpdate> (2)
StartdLog:14-09-24_15:52:21 Return from Handler <receiveJobClassAdUpdate> 0.0000s
StartdLog:14-09-24_15:52:21 Calling Handler <receiveJobClassAdUpdate> (5)
StartdLog:14-09-24_15:52:21 Return from Handler <receiveJobClassAdUpdate> 0.0001s
StartdLog:14-09-24_15:52:21 Calling Handler <receiveJobClassAdUpdate> (5)
StartdLog:14-09-24_15:52:21 Return from Handler <receiveJobClassAdUpdate> 0.0000s
StartdLog:14-09-24_15:52:21 DaemonCore: pid 2044 exited with status 0, invoking reaper 5 <reaper>
StartdLog:14-09-24_15:52:21 Starter pid 2044 exited with status 0
StartdLog:14-09-24_15:52:21 slot1: State change: starter exited
StartdLog:14-09-24_15:52:21 slot1: State change: No preempting claim, returning to owner
StartdLog:14-09-24_15:52:21 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
StartdLog:14-09-24_15:52:21 slot1: State change: IS_OWNER is false
StartdLog:14-09-24_15:52:21 slot1: Changing state: Owner -> Unclaimed
StartdLog:14-09-24_15:52:21 DaemonCore: return from reaper for pid 2044
StartdLog:14-09-24_15:52:21 DaemonCore: pid 2113 exited with status 0, invoking reaper 5 <reaper>
StartdLog:14-09-24_15:52:21 Starter pid 2113 exited with status 0
StartdLog:14-09-24_15:52:21 slot3: State change: starter exited
StartdLog:14-09-24_15:52:21 slot3: State change: No preempting claim, returning to owner
StartdLog:14-09-24_15:52:21 slot3: Changing state and activity: Preempting/Vacating -> Owner/Idle
StartdLog:14-09-24_15:52:21 slot3: State change: IS_OWNER is false
StartdLog:14-09-24_15:52:21 slot3: Changing state: Owner -> Unclaimed
StartdLog:14-09-24_15:52:21 DaemonCore: return from reaper for pid 2113
StartdLog:14-09-24_15:52:21 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
StartdLog:14-09-24_15:52:21 Received TCP command 403 (DEACTIVATE_CLAIM) from unauthenticated@unmapped <10.150.100.40:47246>, access level DAEMON
StartdLog:14-09-24_15:52:21 Calling HandleReq <command_handler> (0) for command 403 (DEACTIVATE_CLAIM) from unauthenticated@unmapped <10.150.100.40:47246>
StartdLog:14-09-24_15:52:21 Error: can't find resource with ClaimId (<10.150.91.3:43647>#1410524749#47#...) -- perhaps this claim was already removed?
StartdLog:14-09-24_15:52:21 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
StartdLog:14-09-24_15:52:21 Return from HandleReq <command_handler> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
StartdLog:14-09-24_15:52:21 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0005s
StartdLog:14-09-24_15:52:21 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
StartdLog:14-09-24_15:52:21 Received TCP command 403 (DEACTIVATE_CLAIM) from unauthenticated@unmapped <10.150.100.40:57472>, access level DAEMON
StartdLog:14-09-24_15:52:21 Calling HandleReq <command_handler> (0) for command 403 (DEACTIVATE_CLAIM) from unauthenticated@unmapped <10.150.100.40:57472>
StartdLog:14-09-24_15:52:21 Error: can't find resource with ClaimId (<10.150.91.3:43647>#1410524749#46#...) -- perhaps this claim was already removed?
StartdLog:14-09-24_15:52:21 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
StartdLog:14-09-24_15:52:21 Return from HandleReq <command_handler> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
StartdLog:14-09-24_15:52:21 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0001s
StartdLog:14-09-24_15:52:22 Calling Handler <DaemonCommandProtocol::WaitForSocketData> (2)
StartdLog:14-09-24_15:52:22 Received TCP command 403 (DEACTIVATE_CLAIM) from unauthenticated@unmapped <10.150.100.40:59034>, access level DAEMON
StartdLog:14-09-24_15:52:22 Calling HandleReq <command_handler> (0) for command 403 (DEACTIVATE_CLAIM) from unauthenticated@unmapped <10.150.100.40:59034>
StartdLog:14-09-24_15:52:22 Error: can't find resource with ClaimId (<10.150.91.3:43647>#1410524749#17#...) -- perhaps this claim was already removed?
StartdLog:14-09-24_15:52:22 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
StartdLog:14-09-24_15:52:22 Return from HandleReq <command_handler> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
StartdLog:14-09-24_15:52:22 Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0001s
StarterLog.slot1:14-09-24_15:52:21 Got SIGQUIT.  Performing fast shutdown.
StarterLog.slot1:14-09-24_15:52:21 ShutdownFast all jobs.
StarterLog.slot1:14-09-24_15:52:21 Got SIGTERM. Performing graceful shutdown.
StarterLog.slot1:14-09-24_15:52:21 ShutdownGraceful all jobs.
StarterLog.slot1:14-09-24_15:52:21 Process exited, pid=2173, status=0
StarterLog.slot1:14-09-24_15:52:21 condor_read() failed: recv(fd=7) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.40:38878>.
StarterLog.slot1:14-09-24_15:52:21 IO: Failed to read packet header
StarterLog.slot1:14-09-24_15:52:21 condor_write(): Socket closed when trying to write 92 bytes to <10.150.100.40:38878>, fd is 7
StarterLog.slot1:14-09-24_15:52:21 Buf::write(): condor_write() failed
StarterLog.slot1:14-09-24_15:52:21 Failed to send job exit status to shadow
StarterLog.slot1:14-09-24_15:52:21 Last process exited, now Starter is exiting
StarterLog.slot1:14-09-24_15:52:21 **** condor_starter (condor_STARTER) pid 2044 EXITING WITH STATUS 0
StarterLog.slot2:14-09-24_15:52:20 Got SIGQUIT.  Performing fast shutdown.
StarterLog.slot2:14-09-24_15:52:20 ShutdownFast all jobs.
StarterLog.slot2:14-09-24_15:52:20 Process exited, pid=2172, status=0
StarterLog.slot2:14-09-24_15:52:20 Last process exited, now Starter is exiting
StarterLog.slot2:14-09-24_15:52:20 **** condor_starter (condor_STARTER) pid 2084 EXITING WITH STATUS 0
StarterLog.slot3:14-09-24_15:52:21 Got SIGQUIT.  Performing fast shutdown.
StarterLog.slot3:14-09-24_15:52:21 ShutdownFast all jobs.
StarterLog.slot3:14-09-24_15:52:21 Got SIGTERM. Performing graceful shutdown.
StarterLog.slot3:14-09-24_15:52:21 ShutdownGraceful all jobs.
StarterLog.slot3:14-09-24_15:52:21 Process exited, pid=2563, status=0
StarterLog.slot3:14-09-24_15:52:21 condor_read() failed: recv(fd=7) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.40:41936>.
StarterLog.slot3:14-09-24_15:52:21 IO: Failed to read packet header
StarterLog.slot3:14-09-24_15:52:21 condor_write(): Socket closed when trying to write 92 bytes to <10.150.100.40:41936>, fd is 7
StarterLog.slot3:14-09-24_15:52:21 Buf::write(): condor_write() failed
StarterLog.slot3:14-09-24_15:52:21 Failed to send job exit status to shadow
StarterLog.slot3:14-09-24_15:52:21 Last process exited, now Starter is exiting
StarterLog.slot3:14-09-24_15:52:21 **** condor_starter (condor_STARTER) pid 2113 EXITING WITH STATUS 0

Unfortunately, the ScheddLogs on 10.150.100.40 have already been
rotated, so I can't look there :( Where should I look next?
I'm rather worried by claims running out (CLAIM_WORKLIFE is 600,
should I increase MAX_CLAIM_ALIVES_MISSED = 20?) and those
"Connection reset by peer" messages (although apparently unrelated)

Thanks,
 S

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}