[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Caching large executable on worker nodes



Hi Ian,

sorry for not getting back to you earlier and thank you very much for
posting "mirror". In fact, I was hoping I could get your solution
translated into powershell on short notice, but I did not make it so
far. I can post my experiences with this comparatively simple solution
as soon as I have them.

Best wishes,

Jens

Am 13.08.15 um 10:23 schrieb Ian Cottam:
> I'm a little surprised that people are not commenting on the "solution" I
> posted.
> To recap, here is the help page
> <http://condor.eps.manchester.ac.uk/examples/getting-data-files-selectively
> -at-runtime-an-example/>
> and I posted the Bash script "mirror" in an earlier post.
> 
> I mentioned that folk here use it mainly to avoid large amounts of data
> being transferred repeatedly to the same compute node(s), but there is no
> reason not to regard code as data.
> 
> If you are ignoring this approach because it has an obvious flaw, we would
> like to hear what it is.
> (Windows only pools would have to re-script mirror in e.g. powershell.)
> Regards
> -Ian
> 
> 
> 
> 
> 
> On 13/08/2015 01:08, "HTCondor-users on behalf of John (TJ) Knoeller"
> <htcondor-users-bounces@xxxxxxxxxxx on behalf of johnkn@xxxxxxxxxxx> wrote:
> 
>> There's not yet a good mechanism for this.  Various people are working
>> on good solutions, in the meantime, a bit of scriptery could go a long
>> way.
>>
>> You could produce a workable solution for this problem if you are able
>> to break it into two pieces.
>>
>> 1) Configure the startd's with a custom resource that is 1 or more
>> places to store cached executables using MACHINE_RESOURCE_* configuration.
>>     call it MACHINE_RESOURCE_STAGE = /scratch/stage1 /scratch/stage2
>>
>> 2) pilot jobs grab a stage and transfers a file into it.
>>      executable = stageit.sh
>>      transfer_input_files = ffmpeg
>>      Request_Stage = 1
>>
>> 3) STARTD_CRON job that examines what is in the stage directories and
>> publishes that into the machine ads, So if your program is FFMPEG,
>>     startd cron would publish.
>>     HAS_STAGED_FFMPEG = "/scratch/stage1"
>>    the stage contents would have to be somehow self-describing - and
>> the STARTD_CRON job should probably also expire them.
>>
>> 3) Regular jobs require HAS_STAGED_FFMPEG to be defined in order to match.
>>      Requirements = HAS_STAGED_FFMPEG =!= undefined
>>
>> You would also need some mechanism to make sure that pilot jobs don't
>> trash the stages or overproduce...
>>
>> On 8/11/2015 1:38 PM, Jens Schmaler wrote:
>>> Hi all,
>>>
>>> we are currently in a situation where transferring the executable to the
>>> execute machine for each job starts to get a limiting factor. Our case
>>> is the following:
>>>
>>> - large executable (500MB), which is the same for a large number of jobs
>>> within one cluster (jobs only differ in input arguments)
>>>
>>> - few execute machines, i.e. each execute machine will run many such
>>> jobs (transferring the executable each time although this would not be
>>> necessary)
>>>
>>> - we are using the file transfer mechanism, but I believe the problem
>>> would be similar with a shared file system
>>>
>>> - we would like to keep the current job structure for various reasons,
>>> i.e. we would rather not combine multiple jobs into one longer-running
>>> one (I can provide the arguments for this if needed)
>>>
>>>
>>> My goal would be to reduce the time and network traffic for transferring
>>> the executable thousands of times.
>>>
>>> A very natural idea would be to cache the executable on each execute
>>> machine, hoping that we can make use of it in case we get another job of
>>> the same cluster. I probably would be able to hack something that will
>>> do the trick, although doing it properly might take quite some effort
>>> (when and how to clean up the cache?, ...)
>>>
>>> On the other hand, this seems like a very common problem, so I was
>>> wondering whether Condor offers some built-in magic to cope with this?
>>> Maybe I am missing something obvious?
>>>
>>> Are there any recommended best practices for my case?
>>>
>>> Thank you very much in advance,
>>>
>>> Jens
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
>