[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] "Shadow exception!" error. What happened?



Hi,

I have a Fedora Linux condor (7.4.2), mastering a pool of WIndows XP
systems with condor (7.2.4). I submit a VMware 1.0 virtual machine.
This usually works alright, but occassionally the job gets stuck by this
"Shadow exception".

Can somebody tell me where this comes from?
See for more details below.

Thanks,
Rob.



In the job's log file I get this:

000 (037.000.000) 05/18 23:35:23 Job submitted from host: <xxx.xxx.xxx.xxx:54074>
...
001 (037.000.000) 05/18 23:35:45 Job executing on host: <xxx.xxx.xxx.xxx:2737>
...
007 (037.000.000) 05/18 23:36:07 Shadow exception!
    Error from slot1@32-6: (null)
    0  -  Run Bytes Sent By Job
    11207558  -  Run Bytes Received By Job
...
012 (037.000.000) 05/18 23:36:10 Job was held.
    Error from slot1@32-6: (null)
    Code 6 Subcode 0
...

The SchedLog has this contents:

05/18 23:35:24 (pid:26741) Tables are consistent
05/18 23:35:24 (pid:26741) Rebuilt prioritized runnable job list in 0.000s.
05/18 23:35:24 (pid:26741) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
05/18 23:35:24 (pid:26741) Completed REQUEST_CLAIM to startd slot1@32-6 <xxx.xxx.xxx.xxx:2737> for myname@xxxxxxxxxxxxxx
05/18 23:35:24 (pid:26741) Starting add_shadow_birthdate(37.0)
05/18 23:35:24 (pid:26741) Started shadow for job 37.0 on slot1@32-6 <xxx.xxx.xxx.xxx:2737> for myname@xxxxxxxxxxxxxx, (shadow pid = 30312)
05/18 23:35:24 (pid:26741) Activity on stashed negotiator socket
05/18 23:35:24 (pid:26741) Negotiating for owner: myname@xxxxxxxxxxxxxx
05/18 23:35:24 (pid:26741) Out of jobs - 0 jobs matched, 0 jobs idle, flock level = 0
05/18 23:35:29 (pid:26741) Sent ad to central manager for myname@xxxxxxxxxxxxxx
05/18 23:35:29 (pid:26741) Sent ad to 1 collectors for myname@xxxxxxxxxxxxxx
05/18 23:35:29 (pid:26741) attempt to connect to <xxx.xxx.xxx.xxx:9618> failed: Connection refused (connect errno = 111).
05/18 23:35:29 (pid:26741) ERROR: SECMAN:2004:Failed to create security session to <xxx.xxx.xxx.xxx:9618> with TCP.|SECMAN:2003:TCP connection to <xxx.xxx.xxx.xxx:9618> failed.
05/18 23:35:29 (pid:26741) Failed to start non-blocking update to <xxx.xxx.xxx.xxx:9618>.
05/18 23:36:10 (pid:26741) Shadow pid 30312 for job 37.0 exited with status 112
05/18 23:36:10 (pid:26741) Putting job 37.0 on hold
05/18 23:36:11 (pid:26741) Checking consistency running and runnable jobs
05/18 23:36:11 (pid:26741) Tables are consistent
05/18 23:36:11 (pid:26741) Rebuilt prioritized runnable job list in 0.000s.  (Expedited rebuild because no match was found)
05/18 23:36:11 (pid:26741) match (slot1@32-6 <xxx.xxx.xxx.xxx:2737> for myname@xxxxxxxxxxxxxx) out of jobs; relinquishing
05/18 23:36:11 (pid:26741) Completed RELEASE_CLAIM to startd at <xxx.xxx.xxx.xxx:2737>
05/18 23:36:11 (pid:26741) Match record (slot1@32-6 <xxx.xxx.xxx.xxx:2737> for myname@xxxxxxxxxxxxxx, 37.-1) deleted



The condor_status has 2 idle Windows XP machines:

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@32-6         WINNT51    INTEL  Unclaimed Idle     0.000  1010  0+00:09:24
slot2@32-6         WINNT51    INTEL  Unclaimed Idle     0.000  1010  0+00:20:05
slot1@6-4          WINNT51    INTEL  Unclaimed Idle     0.000  1010  0+00:20:04
slot2@6-4          WINNT51    INTEL  Unclaimed Idle     0.000  1010  0+00:20:05
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

       INTEL/WINNT51     4     0       0         4       0          0        0

               Total     4     0       0         4       0          0        0