[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Caching large executable on worker nodes



Hi Jens,

I know that the Open Science Grid's version of HTCondor has the capability I described.
I don't know if it is readily available right now.

You're correct about the local network bandwidth if the caching is only taking place at the HTTP proxy machine.
I do not know if there's a switch to extend it to the individual compute nodes.

Perhaps someone on the list could weigh in on these questions.

Best regards,
Â
Don
Â

Don Krieger, Ph.D.
Department of Neurological Surgery
University of Pittsburgh
(412)648-9654 Office
(412)521-4431 Cell/Text


> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Jens Schmaler
> Sent: Wednesday, August 12, 2015 12:12 PM
> To: HTCondor-Users Mail List
> Subject: Re: [HTCondor-users] Caching large executable on worker nodes
> 
> Hi Don,
> 
> thank you very much for this hint which gave me "SQUID" as an additional
> keyword. According to this talk, it seems that this will actually become part of
> Condor 8.4.0:
> 
> http://research.cs.wisc.edu/htcondor/HTCondorWeek2015/presentations/Vuos
> aloC_FileTransCachingProxy.pdf
> 
> 
> Still, I must admit that I do not fully understand the concept yet. Even with a
> SQUID cache for my cluster, my large executable will still be transferred over
> the network to the execute machine for each job. The SQUID server might take
> the load from the submit machine and ideally would have a better network
> bandwidth, but the overall network traffic remains. I do not believe that there
> will be a slim SQUID proxy on each execute machine which caches everything
> locally, right?
> 
> Cheers,
> 
> Jens
> 
> 
> 
> 
> Am 11.08.15 um 22:15 schrieb Krieger, Donald N.:
> > It's my understanding that many OSG HTcondor installations include the
> > SQUID caching mechanism.
> >
> > This works for files which are fetched via http.
> >
> > We have divided the files which go to each job into 1 chunk which is
> > the same for all jobs (â20 Mbytes), a 2nd chunk which is the same for
> > blocks of â3000 jobs (â100 Mbytes), and a 3rd chunk which is different
> > for each job (â50 Kbytes).  Spot checks show that the caching mechanism
> âhitsâ
> > 80-95% of the time.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Don
> >
> >
> >
> > Don Krieger, Ph.D.
> >
> > Department of Neurological Surgery
> >
> > University of Pittsburgh
> >
> > (412)648-9654 Office
> >
> > (412)521-4431 Cell/Text
> >
> >
> >
> >
> >
> >> -----Original Message-----
> >
> >> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
> >> Behalf
> >
> >> Of Jens Schmaler
> >
> >> Sent: Tuesday, August 11, 2015 2:39 PM
> >
> >> To: HTCondor-Users Mail List
> >
> >> Subject: [HTCondor-users] Caching large executable on worker nodes
> >
> >>
> >
> >> Hi all,
> >
> >>
> >
> >> we are currently in a situation where transferring the executable to
> > the execute
> >
> >> machine for each job starts to get a limiting factor. Our case is the
> > following:
> >
> >>
> >
> >> - large executable (500MB), which is the same for a large number of
> > jobs within
> >
> >> one cluster (jobs only differ in input arguments)
> >
> >>
> >
> >> - few execute machines, i.e. each execute machine will run many such
> >> jobs
> >
> >> (transferring the executable each time although this would not be
> >
> >> necessary)
> >
> >>
> >
> >> - we are using the file transfer mechanism, but I believe the problem
> > would be
> >
> >> similar with a shared file system
> >
> >>
> >
> >> - we would like to keep the current job structure for various
> >> reasons,
> > i.e. we
> >
> >> would rather not combine multiple jobs into one longer-running one (I
> >> can
> >
> >> provide the arguments for this if needed)
> >
> >>
> >
> >>
> >
> >> My goal would be to reduce the time and network traffic for
> > transferring the
> >
> >> executable thousands of times.
> >
> >>
> >
> >> A very natural idea would be to cache the executable on each execute
> > machine,
> >
> >> hoping that we can make use of it in case we get another job of the
> >> same
> >
> >> cluster. I probably would be able to hack something that will do the
> > trick,
> >
> >> although doing it properly might take quite some effort (when and how
> > to clean
> >
> >> up the cache?, ...)
> >
> >>
> >
> >> On the other hand, this seems like a very common problem, so I was
> > wondering
> >
> >> whether Condor offers some built-in magic to cope with this?
> >
> >> Maybe I am missing something obvious?
> >
> >>
> >
> >> Are there any recommended best practices for my case?
> >
> >>
> >
> >> Thank you very much in advance,
> >
> >>
> >
> >> Jens
> >
> >>
> >
> >> _______________________________________________
> >
> >> HTCondor-users mailing list
> >
> >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > <mailto:htcondor-users-request@xxxxxxxxxxx> with a
> >
> >> subject: Unsubscribe
> >
> >> You can also unsubscribe by visiting
> >
> >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> >>
> >
> >> The archives can be found at:
> >
> >> https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> >
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/