[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] "Shadow exception!" error. What happened?

On Tue, 18 May 2010 Todd Tannenbaum wrote:
> Rob wrote:
>     Hi,
>     I have a Fedora Linux condor (7.4.2), mastering a pool of WIndows XP
>     systems with condor (7.2.4). I submit a VMware 1.0 virtual machine.
>     This usually works alright, but occassionally the job gets stuck by this
>     "Shadow exception".
>     Can somebody tell me where this comes from?
>     See for more details below.
> The ShadowLog file would hopefully contain the best clues, esp the
> section of the ShadowLog for the shadow instance that put the job on hold.
> The ShadowLog will contain the job id and the shadow pid number on each
> line - from the schedd log below we see that shadow pid 30312 for job 37.0
> was the one that propagated the error, so that is the section of the shadowlog
> to focus on.

There is only this:

05/18 23:35:24 Initializing a VM shadow for job 37.0
05/18 23:35:28 (37.0) (30312): Request to run on slot1@32-6 <xxx.xxx.xxx.xxx:2737> was ACCEPTED
05/18 23:36:07 (37.0) (30312): Job 37.0 going into Hold state (code 6,0): Error from slot1@32-6: (null)
05/18 23:36:10 (37.0) (30312): **** condor_shadow (condor_SHADOW) pid 30312 EXITING WITH STATUS 112

The job is kept on hold overnight, although this particular machine has been
powered off before midnight.

Also, the next morning plenty of other machines were in Unclaimed state,
but condor continued to keep the job on hold:

$ condor_q -better-analyze

-- Submitter: master.host.name : <xxx.xxx.xxx.xxx:54074> : master.host.name
037.000:  Request is held.

Hold reason: Error from slot1@32-6: (null)

Why condor does not recognize that the pool PC has vanished from the pool,
and schedule the job on another pool PC ?!?!

The Linux Master PC with the scheduler, collector, and negotiator daemons
is running condor version 7.4.2, whereas the pool PCs have version 7.2.4.