[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Caching large executable on worker nodes



There's not yet a good mechanism for this. Various people are working on good solutions, in the meantime, a bit of scriptery could go a long way.

You could produce a workable solution for this problem if you are able to break it into two pieces.

1) Configure the startd's with a custom resource that is 1 or more places to store cached executables using MACHINE_RESOURCE_* configuration.
    call it MACHINE_RESOURCE_STAGE = /scratch/stage1 /scratch/stage2

2) pilot jobs grab a stage and transfers a file into it.
     executable = stageit.sh
     transfer_input_files = ffmpeg
     Request_Stage = 1

3) STARTD_CRON job that examines what is in the stage directories and publishes that into the machine ads, So if your program is FFMPEG,
    startd cron would publish.
    HAS_STAGED_FFMPEG = "/scratch/stage1"
the stage contents would have to be somehow self-describing - and the STARTD_CRON job should probably also expire them.

3) Regular jobs require HAS_STAGED_FFMPEG to be defined in order to match.
     Requirements = HAS_STAGED_FFMPEG =!= undefined

You would also need some mechanism to make sure that pilot jobs don't trash the stages or overproduce...

On 8/11/2015 1:38 PM, Jens Schmaler wrote:
Hi all,

we are currently in a situation where transferring the executable to the
execute machine for each job starts to get a limiting factor. Our case
is the following:

- large executable (500MB), which is the same for a large number of jobs
within one cluster (jobs only differ in input arguments)

- few execute machines, i.e. each execute machine will run many such
jobs (transferring the executable each time although this would not be
necessary)

- we are using the file transfer mechanism, but I believe the problem
would be similar with a shared file system

- we would like to keep the current job structure for various reasons,
i.e. we would rather not combine multiple jobs into one longer-running
one (I can provide the arguments for this if needed)


My goal would be to reduce the time and network traffic for transferring
the executable thousands of times.

A very natural idea would be to cache the executable on each execute
machine, hoping that we can make use of it in case we get another job of
the same cluster. I probably would be able to hack something that will
do the trick, although doing it properly might take quite some effort
(when and how to clean up the cache?, ...)

On the other hand, this seems like a very common problem, so I was
wondering whether Condor offers some built-in magic to cope with this?
Maybe I am missing something obvious?

Are there any recommended best practices for my case?

Thank you very much in advance,

Jens

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/