[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] input file locking?



Hi,
 
I did something similar with my campus grid running BLAST jobs. One solution may be to replicate and rename the inputfile: input1.txt, input2.txt...etc. You can write a one line shell script that can do this for you.
 
Try this with a subset of these and see if your problem goes away.
 
I agree that this may not be a Condor problem and more of an OS/file system problem.   
 
I hope this helps,
 
Jerry Perez
Texas Tech University

________________________________

From: condor-users-bounces@xxxxxxxxxxx on behalf of David A. Kotz
Sent: Fri 11/10/2006 3:34 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] input file locking?



As I said, I've already checked for accessibility of the file from the
execute node, and I've checked daemon logs for any signs of NFS trouble.
  The same executable is being used for all runs, successful and
unsuccessful, and the same input file has been used both successfully
and unsuccessfully.  The submit description queues up about 300 runs of
the same program, which is doing some evolutionary simulations.

- dave


Junjun Mao wrote:
> Most likely this is not condor related, as the job was already started
> by Condor. Try to run the program on the node with failure to see if he
> gets the same error. Then you may want to look if NFS is not stable.
>
> Junjun
>
> On Friday 10 November 2006 16:18, David A. Kotz wrote:
>> When Condor opens an input file for a job, does it lock that file?  I
>> have a user who is submitting hundreds of jobs, all of which refer to
>> a directory (NFS mounted) of text files with one number in each.  At
>> any given time, there may be several jobs using the same input file.
>> Some of the jobs using a given input file run to completion with no
>> problems while others repeatedly fail to run with errors like the
>> following in the shadow log:
>>
>> 11/10 15:10:27 (3744.186) (9335):error: Error: Couldn't open standard
>> file 'inputs/in.186'
>>
>> I've checked the system logs to make sure we aren't having
>> intermittent automounter issues or any other system failings.  The
>> jobs that fail to run keep failing to run, returning to the idle
>> state over and over, even after all of the running jobs have
>> completed.
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>> with a subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at either
>> https://lists.cs.wisc.edu/archive/condor-users/
>> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR