[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] input file locking?



In that case, the error occurs randomly on different machines. One place 
to look is the maximum number of opened file handlers allowed. I don't 
know if "ulimit -a" returns this value. Increasing this number will 
allow more files to be accessed at the same time.

On Friday 10 November 2006 16:40, David A. Kotz wrote:
> As I said, I've already checked for accessibility of the file from
> the execute node, and I've checked daemon logs for any signs of NFS
> trouble. The same executable is being used for all runs, successful
> and unsuccessful, and the same input file has been used both
> successfully and unsuccessfully.  The submit description queues up
> about 300 runs of the same program, which is doing some evolutionary
> simulations.
>
> - dave
>
> Junjun Mao wrote:
> > Most likely this is not condor related, as the job was already
> > started by Condor. Try to run the program on the node with failure
> > to see if he gets the same error. Then you may want to look if NFS
> > is not stable.
> >
> > Junjun
> >
> > On Friday 10 November 2006 16:18, David A. Kotz wrote:
> >> When Condor opens an input file for a job, does it lock that file?
> >>  I have a user who is submitting hundreds of jobs, all of which
> >> refer to a directory (NFS mounted) of text files with one number
> >> in each.  At any given time, there may be several jobs using the
> >> same input file. Some of the jobs using a given input file run to
> >> completion with no problems while others repeatedly fail to run
> >> with errors like the following in the shadow log:
> >>
> >> 11/10 15:10:27 (3744.186) (9335):error: Error: Couldn't open
> >> standard file 'inputs/in.186'
> >>
> >> I've checked the system logs to make sure we aren't having
> >> intermittent automounter issues or any other system failings.  The
> >> jobs that fail to run keep failing to run, returning to the idle
> >> state over and over, even after all of the running jobs have
> >> completed.
> >> _______________________________________________
> >> Condor-users mailing list
> >> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> >> with a subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >> The archives can be found at either
> >> https://lists.cs.wisc.edu/archive/condor-users/
> >> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

-- 
To unsubscribe the mailing list, please send me an email

--
Dr. Junjun Mao, Research Associate
Steinman Hall, #1M-11
Levich Institute at City College of CUNY
140th Street & Convent Avenue
New York, NY 10031
(212) 650-6845 (Phone) 
(212) 650-6835 (fax)