[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE spool removes bl_home directories

    Dear all,

    Weâve upgraded a HTcondor-CE from 3.2.1 (UMD4 repositories) to 3.4.2 (https://research.cs.wisc.edu/htcondor/yum/stable/8.8/rhel7 ).
    The update was done as we wanted to benefit from the improved APEL integration.
    The setup is a CE Htcondor-CE that is submitting to a SLURM instance.

    After the update, we have grid jobs failing, it looks like some cleanup is happening to early.
    What we can see it the spool directory is created, i.e.

    Jobs are correctly routed (as previously) to the slurm instance, and slurm jobs are started.
    The grid pilot jobs start executing. A few seconds, up to ca. 2 minutes later, the bl_home directories in there get removed
    i.e. from Condor stderr of the job:
    _condor_stderr:mkdir: cannot create directory â/users/condor/spool/8429/0/cluster8429.proc0.subproc0/home_bl_7e83452f3d9a/.alienâ: No such file or directory
    The  "home_bl_7e83452f3d9a" subdirectory of the grid job has been removed. We have this same pattern happening for all jobs from multiple grid VOs.

    I've also enabled debugging in the condor config:
    But still I have not been able to find out what's going wrong.

    Tbh, I'm not sure this is the right place for that kind of question, but any help / pointers are really appreciated.
    I'm quite new to HTCondor-CE so at This point, I'm not even sure what to look for in the logs.