[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Caching large executable on worker nodes
- Date: Thu, 13 Aug 2015 12:37:08 +0000
- From: "Krieger, Donald N." <kriegerd@xxxxxxxx>
- Subject: Re: [HTCondor-users] Caching large executable on worker nodes
Sorry for not mentioning your effort and the script you posted.
I was too stretched to go through it before I posted.
I see though that you have implemented a version of the local caching scheme that was discussed.
I wonder how robust it has been in your installation and how robust it would be if deployed more widely.
Do you have any measures for how often and in what manner it fails?
I think your ongoing experience would be helpful.
Don Krieger, Ph.D.
Department of Neurological Surgery
University of Pittsburgh
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Ian Cottam
> Sent: Thursday, August 13, 2015 4:24 AM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] Caching large executable on worker nodes
> I'm a little surprised that people are not commenting on the "solution" I posted.
> To recap, here is the help page
> and I posted the Bash script "mirror" in an earlier post.
> I mentioned that folk here use it mainly to avoid large amounts of data being
> transferred repeatedly to the same compute node(s), but there is no reason not
> to regard code as data.
> If you are ignoring this approach because it has an obvious flaw, we would like
> to hear what it is.
> (Windows only pools would have to re-script mirror in e.g. powershell.) Regards
> On 13/08/2015 01:08, "HTCondor-users on behalf of John (TJ) Knoeller"
> <htcondor-users-bounces@xxxxxxxxxxx on behalf of johnkn@xxxxxxxxxxx>
> >There's not yet a good mechanism for this. Various people are working
> >on good solutions, in the meantime, a bit of scriptery could go a long
> >You could produce a workable solution for this problem if you are able
> >to break it into two pieces.
> >1) Configure the startd's with a custom resource that is 1 or more
> >places to store cached executables using MACHINE_RESOURCE_*
> > call it MACHINE_RESOURCE_STAGE = /scratch/stage1 /scratch/stage2
> >2) pilot jobs grab a stage and transfers a file into it.
> > executable = stageit.sh
> > transfer_input_files = ffmpeg
> > Request_Stage = 1
> >3) STARTD_CRON job that examines what is in the stage directories and
> >publishes that into the machine ads, So if your program is FFMPEG,
> > startd cron would publish.
> > HAS_STAGED_FFMPEG = "/scratch/stage1"
> > the stage contents would have to be somehow self-describing - and
> >the STARTD_CRON job should probably also expire them.
> >3) Regular jobs require HAS_STAGED_FFMPEG to be defined in order to match.
> > Requirements = HAS_STAGED_FFMPEG =!= undefined
> >You would also need some mechanism to make sure that pilot jobs don't
> >trash the stages or overproduce...
> >On 8/11/2015 1:38 PM, Jens Schmaler wrote:
> >> Hi all,
> >> we are currently in a situation where transferring the executable to
> >> the execute machine for each job starts to get a limiting factor. Our
> >> case is the following:
> >> - large executable (500MB), which is the same for a large number of
> >> jobs within one cluster (jobs only differ in input arguments)
> >> - few execute machines, i.e. each execute machine will run many such
> >> jobs (transferring the executable each time although this would not
> >> be
> >> necessary)
> >> - we are using the file transfer mechanism, but I believe the problem
> >> would be similar with a shared file system
> >> - we would like to keep the current job structure for various
> >> reasons, i.e. we would rather not combine multiple jobs into one
> >> longer-running one (I can provide the arguments for this if needed)
> >> My goal would be to reduce the time and network traffic for
> >> transferring the executable thousands of times.
> >> A very natural idea would be to cache the executable on each execute
> >> machine, hoping that we can make use of it in case we get another job
> >> of the same cluster. I probably would be able to hack something that
> >> will do the trick, although doing it properly might take quite some
> >> effort (when and how to clean up the cache?, ...)
> >> On the other hand, this seems like a very common problem, so I was
> >> wondering whether Condor offers some built-in magic to cope with this?
> >> Maybe I am missing something obvious?
> >> Are there any recommended best practices for my case?
> >> Thank you very much in advance,
> >> Jens
> >> _______________________________________________
> >> HTCondor-users mailing list
> >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >>with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/htcondor-users/
> >HTCondor-users mailing list
> >To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> >with a
> >subject: Unsubscribe
> >You can also unsubscribe by visiting
> >The archives can be found at:
> Ian Cottam | IT Relationship Manager | IT Services | C38 Sackville Street
> Building | The University of Manchester | M13 9PL |
> +44(0)161 306 1851
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: