Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Caching large executable on worker nodes

Date: Thu, 13 Aug 2015 08:23:30 +0000
From: Ian Cottam <Ian.Cottam@xxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Caching large executable on worker nodes

I'm a little surprised that people are not commenting on the "solution" I
posted.
To recap, here is the help page
<http://condor.eps.manchester.ac.uk/examples/getting-data-files-selectively
-at-runtime-an-example/>
and I posted the Bash script "mirror" in an earlier post.

I mentioned that folk here use it mainly to avoid large amounts of data
being transferred repeatedly to the same compute node(s), but there is no
reason not to regard code as data.

If you are ignoring this approach because it has an obvious flaw, we would
like to hear what it is.
(Windows only pools would have to re-script mirror in e.g. powershell.)
Regards
-Ian





On 13/08/2015 01:08, "HTCondor-users on behalf of John (TJ) Knoeller"
<htcondor-users-bounces@xxxxxxxxxxx on behalf of johnkn@xxxxxxxxxxx> wrote:

>There's not yet a good mechanism for this.  Various people are working
>on good solutions, in the meantime, a bit of scriptery could go a long
>way.
>
>You could produce a workable solution for this problem if you are able
>to break it into two pieces.
>
>1) Configure the startd's with a custom resource that is 1 or more
>places to store cached executables using MACHINE_RESOURCE_* configuration.
>     call it MACHINE_RESOURCE_STAGE = /scratch/stage1 /scratch/stage2
>
>2) pilot jobs grab a stage and transfers a file into it.
>      executable = stageit.sh
>      transfer_input_files = ffmpeg
>      Request_Stage = 1
>
>3) STARTD_CRON job that examines what is in the stage directories and
>publishes that into the machine ads, So if your program is FFMPEG,
>     startd cron would publish.
>     HAS_STAGED_FFMPEG = "/scratch/stage1"
>    the stage contents would have to be somehow self-describing - and
>the STARTD_CRON job should probably also expire them.
>
>3) Regular jobs require HAS_STAGED_FFMPEG to be defined in order to match.
>      Requirements = HAS_STAGED_FFMPEG =!= undefined
>
>You would also need some mechanism to make sure that pilot jobs don't
>trash the stages or overproduce...
>
>On 8/11/2015 1:38 PM, Jens Schmaler wrote:
>> Hi all,
>>
>> we are currently in a situation where transferring the executable to the
>> execute machine for each job starts to get a limiting factor. Our case
>> is the following:
>>
>> - large executable (500MB), which is the same for a large number of jobs
>> within one cluster (jobs only differ in input arguments)
>>
>> - few execute machines, i.e. each execute machine will run many such
>> jobs (transferring the executable each time although this would not be
>> necessary)
>>
>> - we are using the file transfer mechanism, but I believe the problem
>> would be similar with a shared file system
>>
>> - we would like to keep the current job structure for various reasons,
>> i.e. we would rather not combine multiple jobs into one longer-running
>> one (I can provide the arguments for this if needed)
>>
>>
>> My goal would be to reduce the time and network traffic for transferring
>> the executable thousands of times.
>>
>> A very natural idea would be to cache the executable on each execute
>> machine, hoping that we can make use of it in case we get another job of
>> the same cluster. I probably would be able to hack something that will
>> do the trick, although doing it properly might take quite some effort
>> (when and how to clean up the cache?, ...)
>>
>> On the other hand, this seems like a very common problem, so I was
>> wondering whether Condor offers some built-in magic to cope with this?
>> Maybe I am missing something obvious?
>>
>> Are there any recommended best practices for my case?
>>
>> Thank you very much in advance,
>>
>> Jens
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>_______________________________________________
>HTCondor-users mailing list
>To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/htcondor-users/
>


-- 
Ian Cottam  | IT Relationship Manager | IT Services  | C38 Sackville
Street Building  |  The University of Manchester  |  M13 9PL  |
+44(0)161 306 1851

Follow-Ups:
- Re: [HTCondor-users] Caching large executable on worker nodes
  - From: Jens Schmaler
- Re: [HTCondor-users] Caching large executable on worker nodes
  - From: Krieger, Donald N.

References:
- [HTCondor-users] Caching large executable on worker nodes
  - From: Jens Schmaler
- Re: [HTCondor-users] Caching large executable on worker nodes
  - From: John (TJ) Knoeller

Prev by Date: Re: [HTCondor-users] HTCondor Debian Repository Fails for multi-arch
Next by Date: Re: [HTCondor-users] Caching large executable on worker nodes
Previous by thread: Re: [HTCondor-users] Caching large executable on worker nodes
Next by thread: Re: [HTCondor-users] Caching large executable on worker nodes
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Caching large executable on worker nodes