[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] input file locking?



The ulimit idea seemed intriguing at first, but it wouldn't explain why some of the jobs, say 100 of the original 300, never successfully start. I could see 300 jobs causing the user to hit the soft limit of 1024 pretty easily, but once the 200 that will run are finished, I would expect the remaining 100 to run on the next negotiation cycle, but they are sitting in the queue for days.

- dave


Junjun Mao wrote:
In that case, the error occurs randomly on different machines. One place to look is the maximum number of opened file handlers allowed. I don't know if "ulimit -a" returns this value. Increasing this number will allow more files to be accessed at the same time.

On Friday 10 November 2006 16:40, David A. Kotz wrote:
As I said, I've already checked for accessibility of the file from
the execute node, and I've checked daemon logs for any signs of NFS
trouble. The same executable is being used for all runs, successful
and unsuccessful, and the same input file has been used both
successfully and unsuccessfully.  The submit description queues up
about 300 runs of the same program, which is doing some evolutionary
simulations.

- dave

Junjun Mao wrote:
Most likely this is not condor related, as the job was already
started by Condor. Try to run the program on the node with failure
to see if he gets the same error. Then you may want to look if NFS
is not stable.

Junjun

On Friday 10 November 2006 16:18, David A. Kotz wrote:
When Condor opens an input file for a job, does it lock that file?
 I have a user who is submitting hundreds of jobs, all of which
refer to a directory (NFS mounted) of text files with one number
in each.  At any given time, there may be several jobs using the
same input file. Some of the jobs using a given input file run to
completion with no problems while others repeatedly fail to run
with errors like the following in the shadow log:

11/10 15:10:27 (3744.186) (9335):error: Error: Couldn't open
standard file 'inputs/in.186'

I've checked the system logs to make sure we aren't having
intermittent automounter issues or any other system failings.  The
jobs that fail to run keep failing to run, returning to the idle
state over and over, even after all of the running jobs have
completed.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR