[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Shadow Exception, why?



All Windows 7x64 pool
$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
$CondorPlatform: x86_64_Windows7 $


I've been getting lots of Shadow Exceptions, here's a typical one (job log file):

000 (117.019.000) 12/16 18:02:12 Job submitted from host: <x.y.z.189:9728>
...
007 (117.019.000) 12/16 18:08:08 Shadow exception!
    Error from slot4@xxxxxxxxxxxxxxxxx: Failed to transfer files
    0  -  Run Bytes Sent By Job
    13252  -  Run Bytes Received By Job
...

The ShadowLog on the submit machine (.189) (bdomo-002):

12/16/13 18:18:22 (117.1) (6616): Job 117.1 is being evicted from slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:18:22 (117.1) (6616): **** condor_shadow (condor_SHADOW) pid 6616 EXITING WITH STATUS 102
12/16/13 18:19:38 (117.5) (8068): Job 117.5 is being evicted from slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:38 (117.5) (8068): **** condor_shadow (condor_SHADOW) pid 8068 EXITING WITH STATUS 102
12/16/13 18:19:40 (117.11) (7936): Job 117.11 is being evicted from slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:40 (117.11) (7936): **** condor_shadow (condor_SHADOW) pid 7936 EXITING WITH STATUS 102
12/16/13 18:23:01 (117.2) (6880): Job 117.2 is being evicted from slot3@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:01 (117.2) (6880): **** condor_shadow (condor_SHADOW) pid 6880 EXITING WITH STATUS 102
12/16/13 18:23:12 (117.3) (6196): Job 117.3 is being evicted from slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:12 (117.3) (6196): **** condor_shadow (condor_SHADOW) pid 6196 EXITING WITH STATUS 102

We have a typical nominal 1 Gb/s switch for our LAN. The files transferred for each submit job are a couple of dozen, and are at most 200 MB total size. 20 jobs submitted at one time to the queue.

Should this really cause a problem? Is there a way to find out if a failure to transfer files REALLY is the problem? I'm thinking not. Even though Condor starts new execute jobs, the master program (run interactively from a command prompt window) usually doesn't see them. So I submit another 20, kill the old set, and everything is good, no shadow exceptions and the master program finds its condorized slaves.  Maybe the shadows on my submit machine are giving up too quick because of some delay??

Ralph Finch
Calif. Dept. of Water Resources
Sacramento, Calif. USA