[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE spool removes bl_home directories



    Dear all,

    Weâve upgraded a HTcondor-CE from 3.2.1 (UMD4 repositories) to 3.4.2 (https://research.cs.wisc.edu/htcondor/yum/stable/8.8/rhel7 ).
    The update was done as we wanted to benefit from the improved APEL integration.
    The setup is a CE Htcondor-CE that is submitting to a SLURM instance.

    After the update, we have grid jobs failing, it looks like some cleanup is happening to early.
    What we can see it the spool directory is created, i.e.
    /users/condor/spool/8429/0/cluster8429.proc0.subproc0

    Jobs are correctly routed (as previously) to the slurm instance, and slurm jobs are started.
    The grid pilot jobs start executing. A few seconds, up to ca. 2 minutes later, the bl_home directories in there get removed
    i.e. from Condor stderr of the job:
    _condor_stderr:mkdir: cannot create directory â/users/condor/spool/8429/0/cluster8429.proc0.subproc0/home_bl_7e83452f3d9a/.alienâ: No such file or directory
    The  "home_bl_7e83452f3d9a" subdirectory of the grid job has been removed. We have this same pattern happening for all jobs from multiple grid VOs.

    I've also enabled debugging in the condor config:
    ALL_DEBUG = D_ALWAYS:2 D_CAT D_SECURITY
    But still I have not been able to find out what's going wrong.


    Tbh, I'm not sure this is the right place for that kind of question, but any help / pointers are really appreciated.
    I'm quite new to HTCondor-CE so at This point, I'm not even sure what to look for in the logs.

    Best,
    Erich