[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] DAG Windows Problem : Error: Unable to monitor node job log file



Hi All,
 
Does Someone knows maybe how to over come this?
 
I have a simple DAG job file looks like this:
 
JOB  A  A.job
JOB  B  B.job
PARENT A CHILD B
 
Job A and Job B can run on the Windows Condor Cluster without any problem.
 
Here is how A.Job looks like:
 
universe = vanilla
transfer_files=always
requirements =
executable = U:\runA.bat
Arguments  =
output =A.out
log = A.log
error = A.err
notification = Error
initialdir = U:
run_as_owner = True
load_profile = True
queue 4
 
 
Now when runing the DAG job using condor_submit_dag.exe DAG.job I get the following error:
 
 
 
7/20/10 10:20:25 WARNING: ProcessId not confirmed unique
07/20/10 10:20:25 Bootstrapping...
07/20/10 10:20:25 Number of pre-completed nodes: 0
07/20/10 10:20:25 Registering condor_event_timer...
07/20/10 10:20:26 Sleeping for one second for log file consistency
07/20/10 10:20:27 DAGMan::Job:8001:ERROR: Unable to monitor log file for node A|ReadMultipleUserLogs:9004:Error getting file ID in monitorLogFile()|ReadMultipleUserLogs:9004:Error initializing log file U:\A.log|MultiLogFiles:9001:Error (2, No such file or directory) opening file U:\A.log for creation or truncation
07/20/10 10:20:27 Of 2 nodes total:
07/20/10 10:20:27  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
07/20/10 10:20:27   ===     ===      ===     ===     ===        ===      ===
07/20/10 10:20:27     0       0        0       0       0          2        0
07/20/10 10:20:27 ERROR: a cycle exists in the DAG
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27       Node Name: A
07/20/10 10:20:27            Noop: false
07/20/10 10:20:27          NodeID: 0
07/20/10 10:20:27     Node Status: STATUS_ERROR   
07/20/10 10:20:27 Node return val: -1003
07/20/10 10:20:27           Error: Unable to monitor node job log file
07/20/10 10:20:27 Job Submit File: A.job
07/20/10 10:20:27   Condor Job ID: [not yet submitted]
07/20/10 10:20:27       Q_PARENTS: <END>
07/20/10 10:20:27       Q_WAITING: <END>
07/20/10 10:20:27      Q_CHILDREN: B, <END>
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27       Node Name: B
07/20/10 10:20:27            Noop: false
07/20/10 10:20:27          NodeID: 1
07/20/10 10:20:27     Node Status: STATUS_READY   
07/20/10 10:20:27 Node return val: -1
07/20/10 10:20:27 Job Submit File: B.job
07/20/10 10:20:27   Condor Job ID: [not yet submitted]
07/20/10 10:20:27       Q_PARENTS: A, <END>
07/20/10 10:20:27       Q_WAITING: A, <END>
07/20/10 10:20:27      Q_CHILDREN: <END>
07/20/10 10:20:27 --------------------------------------- <END>
07/20/10 10:20:27 Aborting DAG...
07/20/10 10:20:27 Writing Rescue DAG to dag.dag.rescue001...
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxJobs limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxIdle limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of node category throttles
07/20/10 10:20:27 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
07/20/10 10:20:27 Note: 0 total POST script deferrals because of -MaxPost limit (0)
 
 
I found thie https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=831
 
But it doesn't say much. Can someone please drop a comment on this?
This Job is part of a hadoop cluster that I'm trying to build.
 
Thank you
 
Sassy