Hi Brian,yes, we have periodic removes . But 'in principle' these should mostly only work on longer time scales ~O(days) - except for the JobRunCount hedge. Idea behind the `JobRunCount > 1` is to avboid automatic reruns of jobs as to avoid clashes with the VO factories, if these would resend the jobs again on a failure and would then result in two job instances.
But the problem with the missing out/err affected also CLUSTERID.0 jobs, that should be the initial iteration and not fall under `JobRunCount > 1`, or?
Cheers, Thomas  > grep -v "#" /etc/condor/config.d/90_21_condor_cleanup.confRemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 60 * 60 * 24 * 2) )
RemoveMultipleRunJobs = ( JobRunCount > 1 ) RemoveDefaultJobWallTime = ( RemoteWallClockTime > 4 * 24 * 60 * 60 ) RemoveAllJobsOlderThan2Weeks = (( CurrentTime - QDate > 60 * 60 * 24 * 14)) SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs) || \ $(RemoveMultipleRunJobs) || \ $(RemoveDefaultJobWallTime) || \ $(RemoveAllJobsOlderThan2Weeks) On 04/03/2021 17.55, Brian Lin wrote:
Hi Thomas,Jaime reminded me of another common cause of this issue: that the routed job is removed from under the CE so when the CE tries to transfer files back out to the submitter, it can't find the files it needs. Do you have any periodic removes in your local HTCondor config?Thanks, Brian On 3/4/21 10:07 AM, Thomas Hartmann wrote:Hi Brian unfortunately, I have not found a smoking gun yet :-/ The CE is currently on .selinux gets disabled by default and on quick check on the permissions , I did notice anything suspicious . The files got the correct mapped user - including the CLUSTERID.log. Also the possible open file handles should be sufficient. On the fs side it is ext4 - so nothing fancy. And I do not see much I/O wait or so, which might point to an underlying issue with the HV.I noticed several stack dumps on the CE. But AFAIS there has been no overlap between the affected PIDs/IDs and with these jobs.Cheers, Â Thomas  condor-8.9.11-1.el7.x86_64 condor-boinc-7.16.11-1.el7.x86_64 condor-classads-8.9.11-1.el7.x86_64 condor-externals-8.9.11-1.el7.x86_64 condor-procd-8.9.11-1.el7.x86_64 htcondor-ce-4.4.1-3.el7.noarch htcondor-ce-apel-4.4.1-3.el7.noarch htcondor-ce-bdii-4.4.1-3.el7.noarch htcondor-ce-client-4.4.1-3.el7.noarch htcondor-ce-condor-4.4.1-3.el7.noarch htcondor-ce-view-4.4.1-3.el7.noarch python2-condor-8.9.11-1.el7.x86_64 python3-condor-8.9.11-1.el7.x86_64 CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64 root@grid-htcondorce0: [~] ls -all /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0total 80 drwx------ 2 belleprd000 belleprdÂ 4096 MarÂ 3 06:51 . drwxr-xr-x 4 condorÂÂÂÂÂ condorÂÂÂ 4096 MarÂ 3 06:51 .. -rw-r--r-- 1 belleprd000 belleprdÂ 1028 MarÂ 3 10:36 406446.0.log-rwxr-xr-x 1 belleprd000 belleprd 55919 MarÂ 3 06:51 DIRAC_nd5lYU_pilotwrapper.py-rw------- 1 belleprd000 belleprd 10354 MarÂ 3 06:51 tmpBU9zHQ > sestatus SELinux status:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ disabled > cat /proc/sys/fs/file-max 1552725
Description: S/MIME Cryptographic Signature