[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cannot use singularity cache directory



On 4/7/23 04:16, Florent CouziniÃ-Devy wrote:
I set up a local singularity registry from our small cluster and wanted to profit from the caching of images included in singularity. We often use images bigger than 1GB and transferring the image to the executor for each job seems wasteful (and slow in our case). Apptainer/singularity has a nice feature where if you load the images from a registry, it checks the hash to see if the image exists in the cache folder and only download it if not already available locally.

The documentation [1] suggests that the singularity cache directory is configurable with the sentence "There you will find parameters to customize things such as [...], cache directory," but it actually is not. The environment variable used to configure it "APPTAINER_CACHEDIR" is overwritten by htcondor and set to the "execute_dir", which is temporary and not adapted for caching. I checked the code and this is done in `condor_starter-V6.1/singularity.cpp:l403-411`. The comments suggest this config is done for image rebuilt from docker images (which might not be cached, I don't have much experience with the "docker://" handle). It would be nice it the APPTAINER_CACHEDIR was only set to "execute_dir" when it is not already defined by the user or the condor configuration.


Hi Florent:

I'm sorry you are having problems with this. A couple of complications we should keep in mind -- one of the reasons HTCondor sets APPTAINER_CACHEDIR to the execute directory is that we can guarantee we cleanup and remove the cached files. Maybe sooner than you'd like in this case, but otherwise we'd keep them around forever, and fill up the disk. Another complication is that, by default, apptainer puts the cache under the home directory. But in some HTCondor setups, we run with "slot users", and several different submitting users may share the same Unix uid and home directory on the worker node. In many cases, admins setup the home directories to not be writeable by the slot user.


For now, I am thinking of "solving" this problem by using a DAG job, with a first job that pull the image only if necessary and the second that uses the local image but it complexifies the workflow unnecessarily. Does anyone see a better solution to reuse a singularity image by storing it on the local storage of an executor?


When would you want the local copy to be removed? Do you have a small, fixed number of images you are interested in running?


greg