[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Dagman and rescue files



Colin,

Failure of the dag is complcated by pre and post scripts.
These are critical tools and in many ways override to results
from the node it self. Read the manual carefully here.

But in short, if there is a post script, even if the node fails,
a 0 return from the postscript means the node succeeeded. And similarly
even if the node ran and returned 0, a non-zero return from the 
post-script would be considered failure and a rescue dag would be written.

I suspect your post-script returns a non-zero value.

Bill
Condor Team 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Colin Gillespie
> Sent: Tuesday, September 14, 2004 5:02 AM
> To: condor-users@xxxxxxxxxxx
> Subject: [Condor-users] Dagman and rescue files
> 
> Dear All,
> 
> I have a very simple Dag
> 
> Job condor_script1 /home/condor_script1.sub Script POST 
> condor_script1 /home/data2db.py $RETURN $JOB
> 
> but always creates a rescue file. The script currently does 
> something simple, e.g. nothing or write to file. When the 
> script writes to file, it does produce the desired output.
> 
> Can anyone suggest why a rescue file is being created?
> 
> Many thanks 
> 
> Colin
> 
> The rescue file is:
> # Rescue DAG file, created after running
> #   the /home/condor_scriptDag1.dag DAG file
> #
> # Total number of Nodes: 1
> # Nodes premarked DONE: 0
> # Nodes that failed: 0
> #   <ENDLIST>
> 
> JOB condor_script1 /home/condor_script1.sub SCRIPT POST 
> condor_script1 /home/data2db.py $RETURN $JOB
> 
> 
> The dagman out file is:
> <snip>
> 9/14 10:47:46 Job condor_script1 completed successfully.
> 9/14 10:47:46 Running POST script of Job condor_script1...
> 9/14 10:47:46 Of 1 nodes total:
> 9/14 10:47:46  Done     Pre   Queued    Post   Ready   
> Un-Ready   Failed
> 9/14 10:47:46   ===     ===      ===     ===     ===        
> ===      ===
> 9/14 10:47:46     0       0        0       1       0          
> 0        0
> 9/14 10:47:46 UserLog::initialize: open("") failed - errno 2 
> (No such file or directory) 9/14 10:47:51 Of 1 nodes total:
> 9/14 10:47:51  Done     Pre   Queued    Post   Ready   
> Un-Ready   Failed
> 9/14 10:47:51   ===     ===      ===     ===     ===        
> ===      ===
> 9/14 10:47:51     0       0        0       0       0          
> 1        0
> 9/14 10:47:51 ERROR: a cycle exists in the DAG
> 9/14 10:47:51 Aborting DAG...
> 9/14 10:47:51 Writing Rescue DAG to
> /home1/ncsg3/basis/simulator/condor_scriptDag1.dag.rescue...
> 9/14 10:47:51 **** condor_scheduniv_exec.738.0 
> (condor_DAGMAN) EXITING WITH STATUS 1 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>