[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Shadow exceptions on Window Machines



Hi Richard,
	When I was getting that error (I'm all Windows XP as well), it
seemed it was always because the job was having trouble finding an input
file.  If the condor starter can't find an input file, rather than
saying it can't find it, it'll just balk speaking back to the
condor_shadow with that error.  The condor_shadow is a process that runs
on the submitting machine to keep track of a running job it submitted
earlier.  It's sort of a red herring since the shadow is likely working
fine.  The way I tracked it down was to find out which machine it was
running on, finding the log file on that machine
\\machine\c$\condor\execute\somedir (if you have access to that drive)
and checking for the .log/.out/.err files nd seeing if one of the
executables complained about not finding a file.  There's probably a
command to retrieve those logs automatically, but give that a shot.

	-Zack

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dodge, Richard
Sent: Wednesday, March 30, 2005 12:21 PM
To: Condor-Users Mail List
Subject: [Condor-users] Shadow exceptions on Window Machines

What are shadow exceptions and what can I do to avoid them?

When submitting jobs to Windows XP machines, I repeatedly get "shadow
exceptions" which cause the job to rerun itself, never giving any
results.  This seems to occur on jobs with run length of at least 8
hours.  Shorter running jobs on these same machines run fine.  When
running jobs on Linux machines, jobs that run > 48 hours never have
"shadow exceptions".

Here is a typical job.log from a job having "shadow exception" problem:


000 (3387.000.000) 03/23 15:37:59 Job submitted from host:
<172.16.230.242:32775>
...
001 (3387.000.000) 03/23 17:20:35 Job executing on host:
<172.16.204.38:1047>
...
006 (3387.000.000) 03/23 17:20:43 Image size of job updated: 65836
...
006 (3387.000.000) 03/23 17:40:46 Image size of job updated: 252316
...
007 (3387.000.000) 03/24 03:13:43 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(172.16.204.38)
	0  -  Run Bytes Sent By Job
	6627422  -  Run Bytes Received By Job
...
001 (3387.000.000) 03/24 03:13:47 Job executing on host:
<172.16.204.38:1047>
...
006 (3387.000.000) 03/24 03:33:55 Image size of job updated: 252316
...
004 (3387.000.000) 03/24 07:33:32 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 04:17:51, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	6627422  -  Run Bytes Received By Job
...
009 (3387.000.000) 03/24 07:33:32 Job was aborted by the user.
	via condor_rm (by user psmd)
...

Thanks,

Richard Dodge


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Colin Gillespie
Sent: Monday, February 14, 2005 4:33 AM
To: Condor-Users Mail List
Subject: [Condor-users] Shadow exceptions


Dear All,

My simple dag gave off the error code 7, I know that this is a shadow
exception, but what does this mean? The program seemed to finish
correctly.

Also I have a post-dag that monitors the signals. Basically if an error
occurs then an error is placed in a database. But in this case the job
did finish correctly, what's the best procedure for error monitoring? 

Thanks

Colin

condor_script2381Dag.dag.dagman.out 
<snip>
2/12 10:47:42 Event: ULOG_JOB_TERMINATED for Condor Job
condor_script2381 (5441.0)
2/12 10:47:42 Job condor_script2381 failed with signal 7.
2/12 10:47:42 Running POST script of Job condor_script2381...
2/12 10:47:42 Of 1 nodes total:
2/12 10:47:42  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/12 10:47:42   ===     ===      ===     ===     ===        ===      ===
2/12 10:47:42     0       0        0       1       0          0        0
2/12 10:47:47 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job
condor_script2381 (5441.0)
2/12 10:47:47 POST Script of Job condor_script2381 completed
successfully.
2/12 10:47:47 Of 1 nodes total:
2/12 10:47:47  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
2/12 10:47:47   ===     ===      ===     ===     ===        ===      ===
2/12 10:47:47     1       0        0       0       0          0        0
2/12 10:47:47 All jobs Completed!
2/12 10:47:47 **** condor_scheduniv_exec.5440.0 (condor_DAGMAN) EXITING
WITH STATUS 0


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


------------------------------------------------------------------------
------
This e-mail is intended for the use of the addressee(s) only and may
contain privileged, confidential, or proprietary information that is
exempt from disclosure under law.  If you have received this message in
error, please inform us promptly by reply e-mail, then delete the e-mail
and destroy any printed copy.   Thank you.
========================================================================
======

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users