[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow exceptions on Window Machines



On Wed, Mar 30, 2005 at 02:01:33PM -0600, Dodge, Richard wrote:
> 
> The following seem to occur when shadow exceptions are encountered, but
> not on jobs that complete properly:
> 
> 3/24 03:13:43 condor_write(): send() returned -1, timeout=0,
> errno=10054.  Assuming failure.
> 3/24 03:13:43 Buf::write(): condor_write() failed
> 3/24 03:13:43 ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
> final_transfer))" at line 336 in file
> ..\src\condor_starter.V6.1\jic_shadow.C
> 
> Any ideas ???
> 

Any firewalls involved? Many firewalls drop the state of TCP connections that 
haven't been used in a couple of hours, we've things where long-running 
jobs get killed after a couple of hours, but not short running jobs.

-Erik

> 
> 
> Richard Dodge
> Kimberly-Clark Corporation
> 2100 Winchester Rd.
> Neenah, WI 54956
> (920) 721-5134
> Fax: (920) 721-7748
> rdodge@xxxxxxx
> 
> 
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Alain Roy
> Sent: Wednesday, March 30, 2005 1:29 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Shadow exceptions on Window Machines
> 
> 
> 
> >What are shadow exceptions and what can I do to avoid them?
> 
> The condor_shadow is a program that watches over a job. There is one
> shadow 
> per job, and it runs on the submission computer. When there is an 
> exception, there has been some sort of problem that prevents the shadow 
> from continuing. This could be anything from a permissions problem to a 
> programming error on our part.
> 
> The condor_starter is a program that watches over a job, but it runs on
> the 
> execution machine. It can also have an exception that causes your job to
> fail.
> 
> >007 (3387.000.000) 03/24 03:13:43 Shadow exception!
> >         Can no longer talk to condor_starter on execute machine
> >(172.16.204.38)
> 
> Do two things:
> 
> 1) Look in the ShadowLog for messages from around 3:13 and see what
> error 
> messages you have.
> 
> 2) On the execution computer (172.16.204.38), look in the StarterLog for
> 
> messages around 3:13 and see what error messages you have.
> 
> One of these log files is likely to point the finger at the problem. If
> it 
> doesn't, we can increase the amount of debugging output in the log files
> 
> and try again.
> 
> You might ask--why do you have to go digging through log files in order
> to 
> find the problem? In some cases, we should have implemented a better
> method 
> of propagating errors to you via the user log file. In other cases, it's
> 
> really hard to figure out how to propagate the error messages because of
> 
> the nature of the problem. As we are able to improve the error
> reporting, 
> we do. Given the wide variety of problems that occur, this is a hard
> job.
> 
> I hope this helps to understand the problem.
> 
> -alain
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> ------------------------------------------------------------------------------
> This e-mail is intended for the use of the addressee(s) only and may contain privileged, confidential, or proprietary information that is exempt from disclosure under law.  If you have received this message in error, please inform us promptly by reply e-mail, then delete the e-mail and destroy any printed copy.   Thank you.
> ==============================================================================
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users