[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAGman failed to detect a node's status, seems because it could not read its log.



Hi, Wenger. Thanks for your response.

I do run NFS but I did not export the directory where my jobs are running. I don't think my DAG job failures are due to NFS. And I did disable NFS service and enable NFS support option in condor's global configuration, but this didn't work.

Until I used one same log file, DAGman finally works.

--"When you upgraded the DAGMan version, did you re-run the DAG from scratch, or did you run it in recovery mode?"
No, I didn't. I run my jobs one node by one node before it could work. But I kept my failure .dagmen.out files.

The attachments are the failure logs. Hope them help you experts debug DAGman.

My system is Ubuntu 9.04 x86_64, and I installed RHEl-5 x86_64 version.

Condor is a great project, thanks.

Xiaowei


2009/12/22 R. Kent Wenger <wenger@xxxxxxxxxxx>
On Mon, 21 Dec 2009, dawnsong wrote:

It is fixed in the final. I set all the nodes share a same log file.

As condor manual 7.3 said, DAGman support seperate logs by seperate nodes,
but it seems that all nodes share one same log would make DAGman easy to run
without complainent about "ERROR: failure to read job log".

This confused me sine I have already upgraded to 7.4.

>From your earlier email, it sounds like your log file(s) are on NFS; is
that correct?  If so, that's most likely the source of your problems.

When you upgraded the DAGMan version, did you re-run the DAG from scratch, or did you run it in recovery mode?  Do you still have the log file that generated the error?  If so, I'd like to take a look at it.  I'm guessing that if the file was on NFS, you got corrupted events because of two events being written at the same time.

Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
Xiao-Wei Song
Ping Zhu's Lab, Center for Structural and Molecular Biology
Institute of Biophysics, Chinese Academy of Sciences
15 Datun Road, Chaoyang District, Beijing, China 100101
Tel:  +86-10-64888353, E-mail: dawnsong@xxxxxxxxxxxxxx

Attachment: condor.dagman.out.zip
Description: Zip archive