[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG Windows Problem : Error: Unable to monitor node job log file



On Tue, 20 Jul 2010, Sassy Natan wrote:

One initial question:  what version of Condor are you running?

Does Someone knows maybe how to over come this?

I have a simple DAG job file looks like this:

*JOB  A  A.job
JOB  B  B.job*
*PARENT A CHILD B *

Job A and Job B can run on the Windows Condor Cluster without any problem.

Here is how A.Job looks like:

*universe = vanilla
transfer_files=always

This looks like you are getting should_transfer_files and when_to_transfer_output confused. I think you want:

  should_transfer_files = YES
  when_to_transfer_output = ON_EXIT_OR_EVICT

I don't think this is the cause of the DAGMan problem, but you might as well fix it...

requirements =
executable = U:\runA.bat
Arguments  =
output =A.out
log = A.log
error = A.err
notification = Error
initialdir = U:
run_as_owner = True
load_profile = True
queue 4
*

Now when runing the DAG job using condor_submit_dag.exe DAG.job I get the
following error:



7/20/10 10:20:25 WARNING: ProcessId not confirmed unique

You can ignore this warning.

07/20/10 10:20:25 Bootstrapping...
07/20/10 10:20:25 Number of pre-completed nodes: 0
07/20/10 10:20:25 Registering condor_event_timer...
07/20/10 10:20:26 Sleeping for one second for log file consistency
07/20/10 10:20:27 DAGMan::Job:8001:ERROR: Unable to monitor log file for
node A|ReadMultipleUserLogs:9004:Error getting file ID in
monitorLogFile()|ReadMultipleUserLogs:9004:Error initializing log file
U:\A.log|MultiLogFiles:9001:Error (2, No such file or directory) opening
file U:\A.log for creation or truncation

This is the real problem.

07/20/10 10:20:27 Of 2 nodes total:
07/20/10 10:20:27  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
07/20/10 10:20:27   ===     ===      ===     ===     ===        ===      ===
07/20/10 10:20:27     0       0        0       0       0          2        0
07/20/10 10:20:27 ERROR: a cycle exists in the DAG

DAGMan just thinks a cycle exists because of the previous error.

07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27       Node Name: A
07/20/10 10:20:27            Noop: false
07/20/10 10:20:27          NodeID: 0
07/20/10 10:20:27     Node Status: STATUS_ERROR
07/20/10 10:20:27 Node return val: -1003
07/20/10 10:20:27           Error: Unable to monitor node job log file
07/20/10 10:20:27 Job Submit File: A.job
07/20/10 10:20:27   Condor Job ID: [not yet submitted]
07/20/10 10:20:27       Q_PARENTS: <END>
07/20/10 10:20:27       Q_WAITING: <END>
07/20/10 10:20:27      Q_CHILDREN: B, <END>
07/20/10 10:20:27 ---------------------- Job ----------------------
07/20/10 10:20:27       Node Name: B
07/20/10 10:20:27            Noop: false
07/20/10 10:20:27          NodeID: 1
07/20/10 10:20:27     Node Status: STATUS_READY
07/20/10 10:20:27 Node return val: -1
07/20/10 10:20:27 Job Submit File: B.job
07/20/10 10:20:27   Condor Job ID: [not yet submitted]
07/20/10 10:20:27       Q_PARENTS: A, <END>
07/20/10 10:20:27       Q_WAITING: A, <END>
07/20/10 10:20:27      Q_CHILDREN: <END>
07/20/10 10:20:27 --------------------------------------- <END>
07/20/10 10:20:27 Aborting DAG...
07/20/10 10:20:27 Writing Rescue DAG to dag.dag.rescue001...
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxJobs limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of -MaxIdle limit (0)
07/20/10 10:20:27 Note: 0 total job deferrals because of node category
throttles
07/20/10 10:20:27 Note: 0 total PRE script deferrals because of -MaxPre
limit (0)
07/20/10 10:20:27 Note: 0 total POST script deferrals because of -MaxPost
limit (0)


I found thie https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=831

But it doesn't say much. Can someone please drop a comment on this?
This Job is part of a hadoop cluster that I'm trying to build.

Here are a couple of things to try, just to help diagnose the problem:

1) Create the log files for your jobs before you start the DAG. (You shouldn't have to do this, but given the error message I'd like to see whether things work if you do it.) You can just create zero-size files, or whatever is easiest.

2) Try removing the initialdir specification in the submit files, and just submit the DAG from the U: directory. I don't think this will make any difference, but it would be interesting to find out for sure.

Kent Wenger
Condor Team