[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Caching large executable on worker nodes
- Date: Wed, 12 Aug 2015 19:16:51 +0000
- From: "Krieger, Donald N." <kriegerd@xxxxxxxx>
- Subject: Re: [HTCondor-users] Caching large executable on worker nodes
You're right of course, Dimitri.
I think though that it doesn't have to be bullet-proof.
All that is required to produce a positive effect on network bandwidth is (1) that it works most of the time and (2) that each job instance have a way to recognize a "failure" and go ahead with its own thing. The algorithm for recognizing a failure could start off as simple as a timeout after which a job either initiates its own download or after repeated tries, dies. For that to work, the file needs to be not locked so that the time at which a download is started can be written into it by anyone. The inherent race can be managed by having each job check the last time written into the file at random intervals which are long enough to make the race very unlikely, e.g. 10-20 sec. I routinely use date +%N and usleep to do stuff like this.
Don Krieger, Ph.D.
Department of Neurological Surgery
University of Pittsburgh
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Dimitri Maziuk
> Sent: Wednesday, August 12, 2015 1:19 PM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Caching large executable on worker nodes
> On 08/12/2015 11:57 AM, Krieger, Donald N. wrote:
> > If it doesn't exist, then fetch it.
> If it doesn't exist,
> 1. look for a lock file.
> 2. If it exists, read the PID from the lock file and check if that exists.
> 3. If it does, sleep for a while and goto 1.
> 4. If it doesn't, create the lock file, start the fetch, write the pid of your
> curl/wget to the lock file.
> 5. Wait for transfer to finish.
> 6. If the file does exist, it doesn't mean the transfer's finished or was
> complete/successful. So you want to checksum it and so on.
> So yes, it's pretty simple until it breaks. And then it gets complicated fast.
> Dimitri Maziuk
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu