[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] input file locking?



I also suspect that it's not a Condor problem, per se, but maybe an NFS file lock issue of some sort. I was wondering if Condor locks the input file when reading it. Making a copy of the input files would probably work, and I may ask him to test that, but it's not a viable long-term solution, considering the thousands of runs that some researchers do.

- dave


Perez, Jerry wrote:
Hi,
I did something similar with my campus grid running BLAST jobs. One solution may be to replicate and rename the inputfile: input1.txt, input2.txt...etc. You can write a one line shell script that can do this for you. Try this with a subset of these and see if your problem goes away. I agree that this may not be a Condor problem and more of an OS/file system problem. I hope this helps, Jerry Perez
Texas Tech University

________________________________

From: condor-users-bounces@xxxxxxxxxxx on behalf of David A. Kotz
Sent: Fri 11/10/2006 3:34 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] input file locking?



As I said, I've already checked for accessibility of the file from the
execute node, and I've checked daemon logs for any signs of NFS trouble.
  The same executable is being used for all runs, successful and
unsuccessful, and the same input file has been used both successfully
and unsuccessfully.  The submit description queues up about 300 runs of
the same program, which is doing some evolutionary simulations.

- dave


Junjun Mao wrote:
Most likely this is not condor related, as the job was already started
by Condor. Try to run the program on the node with failure to see if he
gets the same error. Then you may want to look if NFS is not stable.

Junjun

On Friday 10 November 2006 16:18, David A. Kotz wrote:
When Condor opens an input file for a job, does it lock that file?  I
have a user who is submitting hundreds of jobs, all of which refer to
a directory (NFS mounted) of text files with one number in each.  At
any given time, there may be several jobs using the same input file.
Some of the jobs using a given input file run to completion with no
problems while others repeatedly fail to run with errors like the
following in the shadow log:

11/10 15:10:27 (3744.186) (9335):error: Error: Couldn't open standard
file 'inputs/in.186'

I've checked the system logs to make sure we aren't having
intermittent automounter issues or any other system failings.  The
jobs that fail to run keep failing to run, returning to the idle
state over and over, even after all of the running jobs have
completed.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR