[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New NFS warning with condor 6.8.1



Nick,

> This isn't a new problem, just a new warning about an old problem.
> 
> File locking on NFS is inherently unreliable.  We've seen enough
> cases of NFS based job logs getting corrupted (from multiple
> processes updating the log file) that we decided to add the warning.
> I suspect that the risk of such corruption is reduced if all writers
> are on the same machine, possibly even eliminated, but I don't know
> for certain.  In particular, corrupted job logs tend to make DAGMan
> very unhappy.

I've seen this "unreliable" claim before and it's not true.  NFS file
locking is fine.  Dealing with cache coherency tends to be a problem.

> Ultimately, we'd like to implement a more advanced locking mechanism
> (using a separate lock file), but we haven't had time to add this yet.

This will do nothing to help with cache coherency.  In fact using a
separate lock file has its own issues as far as how you create it and
how you test for its existence.

This is how you can do reliable NFS file sharing for the case where
you are simply appending to the end of a file:


   // Open the file.
   if ((fd = open("file", O_RDWR)) == -1)
      ABORT

   // Lock the file.
   if (lockf(fd, F_LOCK, 0) == -1)
      ABORT

   // Now the tricky part, we cached the file size at open(), but we
   // need to know how large it is now.  Force a cache update.
   if (fchmod(fd, 0644) == -1)
      ABORT

   // Now find the end of the file.
   if (lseek(queue_fd, 0, SEEK_END) == (off_t)-1)
      ABORT

   // Now write() and close() as usual.


Note: in theory any function that requires a round trip to the NFS
      server can be used in place of "fchmod()", but it's what I've
      always used and I know it works.

If you are wondering, I have used this method for a combination of
SunOS and Linux boxes, with over 40 machines accessing the same file
at the same time, and with both Linux and EMC NFS servers, and have
never had a problem.  I can't guarantee anything, but my experience
and web research indicates that this method should work in general.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison