[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Caching large executable on worker nodes



Disregard my previous. :)

The CacheD was sardonically described as "grad-student-ware" by the author - quite similar to a feasibility study, but with a shiny cap and gown at the end of it if it's done right.

It's certainly a common concern among the entire user community, so I think that the Squid capability is just the first step in this direction. It may well turn out that CacheD or something like it will mature enough to join the main branch at some point in the future.

I suspect the reason the Squid approach wound up in 8.4 was because of the scalability work they'd done with CERN and the Open Science Grid. It covers a larger set of current use cases more easily than exec-side caching.

One of the difficulties Miron pointed out was that an exec-node cache may reduce available scratch space, so there's some countervailing interests at work there. Another of the presentations discussed characterizing network bandwidth as just another kind of resource, so that the startd could be more intelligent about how it interacts with the rest of the network and heavy jobs could alert the negotiator to that fact, but that was further along the roadmap.

One way I dealt with a similar issue was via a prepare-job hook - you give it a list of stuff to symlink into the scratch directory from a shared filesystem, and when your job starts there's a collection of links pointing to the right version of the input data out on the NFS servers.  The jobs were short-lived and used only a subset of stuff from the scenario files, and symlinking everything was easier than figuring out exactly what the job needed and transferring it over, and transferring the entire kit would have taken longer than the job runtime in some cases.

In addition, the Linux buffer cache came into play for access to the files by the various jobs. The memory on the systems was large enough and the jobs were small enough that the inputs could all fit in the disk cache, and so there was only one batch from the NFS server for each exec node in most cases.

 

Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax

michael.v.pelletier@xxxxxxxxxxxx



"HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx> wrote on 08/12/2015 04:00:32 PM:

> From: Jens Schmaler <jens.schmaler@xxxxxx>

> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Date: 08/12/2015 04:01 PM
> Subject: Re: [HTCondor-users] Caching large executable on worker nodes
> Sent by: "HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx>
>
> Thanks Michael! I stumbled over these talks as well after Don mentioned
> the use of SQUID. It looks as if Condor has decided to go for the SQUID
> solution from 8.4.0 on, right? But what is then the future of the other
> concepts? Are they just feasibility studies? Or were they competitor
> ideas that ultimately did not make it?
>
> Maybe someone of the experts could comment on this - it might be
> interesting for many people working on this kind of problem.
>
> Thanks in advance,
>
> Jens
>