[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exceptions on Window Machines



What are shadow exceptions and what can I do to avoid them?

When submitting jobs to Windows XP machines, I repeatedly get "shadow
exceptions" which cause the job to rerun itself, never giving any
results.  This seems to occur on jobs with run length of at least 8
hours.  Shorter running jobs on these same machines run fine.  When
running jobs on Linux machines, jobs that run > 48 hours never have
"shadow exceptions".

Here is a typical job.log from a job having "shadow exception" problem:


000 (3387.000.000) 03/23 15:37:59 Job submitted from host:
<172.16.230.242:32775>
...
001 (3387.000.000) 03/23 17:20:35 Job executing on host:
<172.16.204.38:1047>
...
006 (3387.000.000) 03/23 17:20:43 Image size of job updated: 65836
...
006 (3387.000.000) 03/23 17:40:46 Image size of job updated: 252316
...
007 (3387.000.000) 03/24 03:13:43 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(172.16.204.38)
	0  -  Run Bytes Sent By Job
	6627422  -  Run Bytes Received By Job
...
001 (3387.000.000) 03/24 03:13:47 Job executing on host:
<172.16.204.38:1047>
...
006 (3387.000.000) 03/24 03:33:55 Image size of job updated: 252316
...
004 (3387.000.000) 03/24 07:33:32 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 04:17:51, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	6627422  -  Run Bytes Received By Job
...
009 (3387.000.000) 03/24 07:33:32 Job was aborted by the user.
	via condor_rm (by user psmd)
...

Thanks,

Richard Dodge


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Colin Gillespie
Sent: Monday, February 14, 2005 4:33 AM
To: Condor-Users Mail List
Subject: [Condor-users] Shadow exceptions


Dear All,

My simple dag gave off the error code 7, I know that this is a shadow
exception, but what does this mean? The program seemed to finish
correctly.

Also I have a post-dag that monitors the signals. Basically if an error
occurs then an error is placed in a database. But in this case the job
did finish correctly, what's the best procedure for error monitoring? 

Thanks

Colin

condor_script2381Dag.dag.dagman.out 
<snip>
2/12 10:47:42 Event: ULOG_JOB_TERMINATED for Condor Job
condor_script2381 (5441.0)
2/12 10:47:42 Job condor_script2381 failed with signal 7.
2/12 10:47:42 Running POST script of Job condor_script2381...
2/12 10:47:42 Of 1 nodes total:
2/12 10:47:42  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/12 10:47:42   ===     ===      ===     ===     ===        ===      ===
2/12 10:47:42     0       0        0       1       0          0        0
2/12 10:47:47 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job
condor_script2381 (5441.0)
2/12 10:47:47 POST Script of Job condor_script2381 completed
successfully.
2/12 10:47:47 Of 1 nodes total:
2/12 10:47:47  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/12 10:47:47   ===     ===      ===     ===     ===        ===      ===
2/12 10:47:47     1       0        0       0       0          0        0
2/12 10:47:47 All jobs Completed!
2/12 10:47:47 **** condor_scheduniv_exec.5440.0 (condor_DAGMAN) EXITING
WITH STATUS 0


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


------------------------------------------------------------------------------
This e-mail is intended for the use of the addressee(s) only and may contain privileged, confidential, or proprietary information that is exempt from disclosure under law.  If you have received this message in error, please inform us promptly by reply e-mail, then delete the e-mail and destroy any printed copy.   Thank you.
==============================================================================