Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Caching large executable on worker nodes
- Date: Wed, 12 Aug 2015 16:31:39 -0400
- From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Caching large executable on worker nodes
Disregard my previous. :)
The CacheD was sardonically described
as "grad-student-ware" by the author - quite similar to a feasibility
study, but with a shiny cap and gown at the end of it if it's done right.
It's certainly a common concern among
the entire user community, so I think that the Squid capability is just
the first step in this direction. It may well turn out that CacheD or something
like it will mature enough to join the main branch at some point in the
future.
I suspect the reason the Squid approach
wound up in 8.4 was because of the scalability work they'd done with CERN
and the Open Science Grid. It covers a larger set of current use cases
more easily than exec-side caching.
One of the difficulties Miron pointed
out was that an exec-node cache may reduce available scratch space, so
there's some countervailing interests at work there. Another of the presentations
discussed characterizing network bandwidth as just another kind of resource,
so that the startd could be more intelligent about how it interacts with
the rest of the network and heavy jobs could alert the negotiator to that
fact, but that was further along the roadmap.
One way I dealt with a similar issue
was via a prepare-job hook - you give it a list of stuff to symlink into
the scratch directory from a shared filesystem, and when your job starts
there's a collection of links pointing to the right version of the input
data out on the NFS servers. The jobs were short-lived and used only
a subset of stuff from the scenario files, and symlinking everything was
easier than figuring out exactly what the job needed and transferring it
over, and transferring the entire kit would have taken longer than the
job runtime in some cases.
In addition, the Linux buffer cache
came into play for access to the files by the various jobs. The memory
on the systems was large enough and the jobs were small enough that the
inputs could all fit in the disk cache, and so there was only one batch
from the NFS server for each exec node in most cases.
|
|
| Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax
michael.v.pelletier@xxxxxxxxxxxx |
"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
wrote on 08/12/2015 04:00:32 PM:
> From: Jens Schmaler <jens.schmaler@xxxxxx>
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Date: 08/12/2015 04:01 PM
> Subject: Re: [HTCondor-users] Caching large executable
on worker nodes
> Sent by: "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
>
> Thanks Michael! I stumbled over these talks as well after Don mentioned
> the use of SQUID. It looks as if Condor has decided to go for the
SQUID
> solution from 8.4.0 on, right? But what is then the future of the
other
> concepts? Are they just feasibility studies? Or were they competitor
> ideas that ultimately did not make it?
>
> Maybe someone of the experts could comment on this - it might be
> interesting for many people working on this kind of problem.
>
> Thanks in advance,
>
> Jens
>