[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow exceptions on Window Machines




What are shadow exceptions and what can I do to avoid them?

The condor_shadow is a program that watches over a job. There is one shadow per job, and it runs on the submission computer. When there is an exception, there has been some sort of problem that prevents the shadow from continuing. This could be anything from a permissions problem to a programming error on our part.


The condor_starter is a program that watches over a job, but it runs on the execution machine. It can also have an exception that causes your job to fail.

007 (3387.000.000) 03/24 03:13:43 Shadow exception!
        Can no longer talk to condor_starter on execute machine
(172.16.204.38)

Do two things:

1) Look in the ShadowLog for messages from around 3:13 and see what error messages you have.

2) On the execution computer (172.16.204.38), look in the StarterLog for messages around 3:13 and see what error messages you have.

One of these log files is likely to point the finger at the problem. If it doesn't, we can increase the amount of debugging output in the log files and try again.

You might ask--why do you have to go digging through log files in order to find the problem? In some cases, we should have implemented a better method of propagating errors to you via the user log file. In other cases, it's really hard to figure out how to propagate the error messages because of the nature of the problem. As we are able to improve the error reporting, we do. Given the wide variety of problems that occur, this is a hard job.

I hope this helps to understand the problem.

-alain