Hi again,maybe related(??) - I just noticed, that a restart of the condor unit caused the Schedd to loose all its jobs . Since the restart was more or less instantaneous, I would have expected the Sched to pick up its jobs.
Cheers, Thomas 03/05/21 10:55:11 (pid:3997828) WARNING - Cluster 437906 was deleted with proc ads still attached to it. This should only happen during schedd shutdown.
Mar 05 10:55:11 grid-htcondorce0.desy.de systemd: Stopping Condor Distributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit condor.service has begun shutting down.Mar 05 10:55:11 grid-htcondorce0.desy.de systemd: Stopped Condor Distributed High-Throughput-Computing.
-- Subject: Unit condor.service has finished shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit condor.service has finished shutting down.Mar 05 10:55:11 grid-htcondorce0.desy.de systemd: Starting Condor Distributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit condor.service has begun starting up.Mar 05 10:55:11 grid-htcondorce0.desy.de systemd: Started Condor Distributed High-Throughput-Computing.
-- Subject: Unit condor.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel On 05/03/2021 10.26, Thomas Hartmann wrote:
Hi Brian,yes, we have periodic removes . But 'in principle' these should mostly only work on longer time scales ~O(days) - except for the JobRunCount hedge. Idea behind the `JobRunCount > 1` is to avboid automatic reruns of jobs as to avoid clashes with the VO factories, if these would resend the jobs again on a failure and would then result in two job instances.But the problem with the missing out/err affected also CLUSTERID.0 jobs, that should be the initial iteration and not fall under `JobRunCount > 1`, or?Cheers, Â Thomas  > grep -v "#" /etc/condor/config.d/90_21_condor_cleanup.confRemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 60 * 60 * 24 * 2) )RemoveMultipleRunJobs = ( JobRunCount > 1 ) RemoveDefaultJobWallTime = ( RemoteWallClockTime > 4 * 24 * 60 * 60 ) RemoveAllJobsOlderThan2Weeks = (( CurrentTime - QDate > 60 * 60 * 24 * 14)) SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)ÂÂÂÂÂÂÂÂÂÂ || \ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveMultipleRunJobs)ÂÂÂ || \ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveDefaultJobWallTime) || \ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveAllJobsOlderThan2Weeks) On 04/03/2021 17.55, Brian Lin wrote:Hi Thomas,Jaime reminded me of another common cause of this issue: that the routed job is removed from under the CE so when the CE tries to transfer files back out to the submitter, it can't find the files it needs. Do you have any periodic removes in your local HTCondor config?Thanks, Brian On 3/4/21 10:07 AM, Thomas Hartmann wrote:Hi Brian unfortunately, I have not found a smoking gun yet :-/ The CE is currently on .selinux gets disabled by default and on quick check on the permissions , I did notice anything suspicious . The files got the correct mapped user - including the CLUSTERID.log. Also the possible open file handles should be sufficient. On the fs side it is ext4 - so nothing fancy. And I do not see much I/O wait or so, which might point to an underlying issue with the HV.I noticed several stack dumps on the CE. But AFAIS there has been no overlap between the affected PIDs/IDs and with these jobs.Cheers, Â Thomas  condor-8.9.11-1.el7.x86_64 condor-boinc-7.16.11-1.el7.x86_64 condor-classads-8.9.11-1.el7.x86_64 condor-externals-8.9.11-1.el7.x86_64 condor-procd-8.9.11-1.el7.x86_64 htcondor-ce-4.4.1-3.el7.noarch htcondor-ce-apel-4.4.1-3.el7.noarch htcondor-ce-bdii-4.4.1-3.el7.noarch htcondor-ce-client-4.4.1-3.el7.noarch htcondor-ce-condor-4.4.1-3.el7.noarch htcondor-ce-view-4.4.1-3.el7.noarch python2-condor-8.9.11-1.el7.x86_64 python3-condor-8.9.11-1.el7.x86_64 CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64 root@grid-htcondorce0: [~] ls -all /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0total 80 drwx------ 2 belleprd000 belleprdÂ 4096 MarÂ 3 06:51 . drwxr-xr-x 4 condorÂÂÂÂÂ condorÂÂÂ 4096 MarÂ 3 06:51 .. -rw-r--r-- 1 belleprd000 belleprdÂ 1028 MarÂ 3 10:36 406446.0.log-rwxr-xr-x 1 belleprd000 belleprd 55919 MarÂ 3 06:51 DIRAC_nd5lYU_pilotwrapper.py-rw------- 1 belleprd000 belleprd 10354 MarÂ 3 06:51 tmpBU9zHQ > sestatus SELinux status:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ disabled > cat /proc/sys/fs/file-max 1552725_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Description: S/MIME Cryptographic Signature