[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow Exception



Hello all

 

We are having some serious problems with our condor setup and I at a loss. Hope someone can help me. We started seeing this problem this weekend. Jobs are being evicted and restarted. I have one example below, but we have been seeing some other errors as well. They all seem to circle around losing connection with the schedd though.

 

In the log of the job I see the following message

 

007 (727.000.000) 08/29 05:29:59 Shadow exception!

        Assertion ERROR on (result)

        0  -  Run Bytes Sent By Job

        0  -  Run Bytes Received By Job

 

I go to the ShadowLog and find the following messages

08/29 05:29:58 (727.22) (27374): condor_write(): Socket closed when trying to write 13 bytes to startd slot12@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.8) (27376): condor_write(): Socket closed when trying to write 13 bytes to startd slot10@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.9) (27377): condor_write(): Socket closed when trying to write 13 bytes to startd slot11@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.3) (27365): condor_write(): Socket closed when trying to write 13 bytes to startd slot5@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.1) (27359): condor_write(): Socket closed when trying to write 13 bytes to startd slot3@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.8) (27376): Buf::write(): condor_write() failed

08/29 05:29:58 (727.0) (27355): condor_write(): Socket closed when trying to write 13 bytes to startd slot2@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (729.25) (27461): condor_write(): Socket closed when trying to write 13 bytes to startd slot4@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.8) (27376): ERROR "Assertion ERROR on (result)" at line 238 in file NTreceivers.cpp

08/29 05:29:58 (729.7) (27465): condor_write(): Socket closed when trying to write 13 bytes to startd slot10@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.3) (27365): Buf::write(): condor_write() failed

08/29 05:29:58 (729.27) (27449): condor_write(): Socket closed when trying to write 13 bytes to startd slot2@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.3) (27365): ERROR "Assertion ERROR on (result)" at line 238 in file NTreceivers.cpp

08/29 05:29:58 (727.0) (27355): Buf::write(): condor_write() failed

08/29 05:29:58 (729.3) (27456): condor_write(): Socket closed when trying to write 13 bytes to startd slot5@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.0) (27355): ERROR "Assertion ERROR on (result)" at line 238 in file NTreceivers.cpp

08/29 05:29:58 (727.19) (27369): condor_write(): Socket closed when trying to write 13 bytes to startd slot9@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (729.1) (27446): condor_write(): Socket closed when trying to write 13 bytes to startd slot3@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (729.23) (27448): condor_write(): Socket closed when trying to write 13 bytes to startd slot2@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:59 (729.1) (27446): Buf::write(): condor_write() failed

08/29 05:29:58 (729.18) (27472): condor_write(): Socket closed when trying to write 13 bytes to startd slot9@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (727.9) (27377): Buf::write(): condor_write() failed

08/29 05:29:58 (729.25) (27461): Buf::write(): condor_write() failed

08/29 05:29:58 (727.20) (27371): condor_write(): Socket closed when trying to write 13 bytes to startd slot10@xxxxxxxxxxxxxxxx, fd is 7

08/29 05:29:58 (731.0) (28148): condor_write(): Socket closed when trying to write 13 bytes to startd slot8@xxxxxxxxxxxxxxxx, fd is 7

 

And in SchedLog:

08/29 05:29:59 (pid:2534) Shadow pid 27355 for job 727.0 exited with status 4

08/29 05:29:59 (pid:2534) ERROR: Shadow exited with job exception code!

08/29 05:29:59 (pid:2534) Checking consistency running and runnable jobs

08/29 05:29:59 (pid:2534) Tables are consistent

08/29 05:29:59 (pid:2534) Rebuilt prioritized runnable job list in 0.017s.  (Expedited rebuild because no match was found)

08/29 05:29:59 (pid:2534) Starting add_shadow_birthdate(727.0)

08/29 05:29:59 (pid:2534) Started shadow for job 727.0 on slot2@xxxxxxxxxxxxxxxx <10.69.200.99:56059> for rni@xxxxxxxxxx, (shadow pid = 30888)

 

Here it seems that the shadow exited with exception, which is bad.

 

And in StartLog

08/29 05:29:03 slot2: State change: claim lease expired (condor_schedd gone?)

08/29 05:29:03 slot2: Changing state and activity: Claimed/Busy -> Preempting/Killing

08/29 05:29:33 slot2: starter (pid 6298) is not responding to the request to hardkill its job.  The startd will now directly hard kill the starter and all its decendents.

08/29 05:29:33 Starter pid 6298 died on signal 9 (signal 9 (Killed))

08/29 05:29:33 slot2: State change: starter exited

08/29 05:29:33 slot2: State change: No preempting claim, returning to owner

08/29 05:29:33 slot2: Changing state and activity: Preempting/Killing -> Owner/Idle

08/29 05:29:33 slot2: State change: IS_OWNER is false

08/29 05:29:33 slot2: Changing state: Owner -> Unclaimed

08/29 05:29:33 State change: RunBenchmarks is TRUE

08/29 05:29:33 slot2: Changing activity: Idle -> Benchmarking

08/29 05:29:37 State change: benchmarks completed

08/29 05:29:37 slot2: Changing activity: Benchmarking -> Idle

 

StarterLog.slot2:

08/28 13:24:44 ******************************************************

08/28 13:24:44 ** condor_starter (CONDOR_STARTER) STARTING UP

08/28 13:24:44 ** /opt/condor/sbin/condor_starter

08/28 13:24:44 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)

08/28 13:24:44 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

08/28 13:24:44 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $

08/28 13:24:44 ** $CondorPlatform: X86_64-LINUX_RHEL5 $

08/28 13:24:44 ** PID = 6298

08/28 13:24:44 ** Log last touched 8/28 12:40:26

08/28 13:24:44 ******************************************************

08/28 13:24:44 Using config source: /opt/condor/etc/condor_config

08/28 13:24:44 Using local config sources:

08/28 13:24:44    /home/condor/hosts/cmp03/condor_config.local

08/28 13:24:44 DaemonCore: Command Socket at <192.168.0.99:41591>

08/28 13:24:44 Done setting resource limits

08/28 13:24:44 Communicating with shadow <192.168.0.82:38708>

08/28 13:24:44 Submitting machine is "cmp04.hpcalc.net"

08/28 13:24:44 setting the orig job name in starter

08/28 13:24:44 setting the orig job iwd in starter

08/28 13:24:44 Job 727.0 set to execute immediately

08/28 13:24:44 Starting a VANILLA universe job with ID: 727.0

08/28 13:24:44 IWD: /data/proj/P04738_PetrojarlVarg/FLACS/Dispersion/Turret

08/28 13:24:44 Output file: /data/proj/Turret/flacs_011211.out

08/28 13:24:44 Error file: /data/proj/Turret/flacs_011211.err

08/28 13:24:44 About to exec /usr/local/bin/runflacs -j 011211

08/28 13:24:44 Create_Process succeeded, pid=6299

 

Regards Peter