[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)



Hi again,

maybe related(??) - I just noticed, that a restart of the condor unit caused the Schedd to loose all its jobs [1]. Since the restart was more or less instantaneous, I would have expected the Sched to pick up its jobs.

Cheers,
  Thomas

[1]
03/05/21 10:55:11 (pid:3997828) WARNING - Cluster 437906 was deleted with proc ads still attached to it. This should only happen during schedd shutdown.


[2]
Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopping Condor Distributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit condor.service has begun shutting down.
Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopped Condor Distributed High-Throughput-Computing.
-- Subject: Unit condor.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit condor.service has finished shutting down.
Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Starting Condor Distributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit condor.service has begun starting up.
Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Started Condor Distributed High-Throughput-Computing.
-- Subject: Unit condor.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel



On 05/03/2021 10.26, Thomas Hartmann wrote:
Hi Brian,

yes, we have periodic removes [1]. But 'in principle' these should mostly only work on longer time scales ~O(days) - except for the JobRunCount hedge. Idea behind the `JobRunCount > 1` is to avboid automatic reruns of jobs as to avoid clashes with the VO factories, if these would resend the jobs again on a failure and would then result in two job instances.

But the problem with the missing out/err affected also CLUSTERID.0 jobs, that should be the initial iteration and not fall under `JobRunCount > 1`, or?

Cheers,
 Â Thomas

[1]
 > grep -v "#" /etc/condor/config.d/90_21_condor_cleanup.conf

RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 60 * 60 * 24 * 2) )

RemoveMultipleRunJobs = ( JobRunCount > 1 )

RemoveDefaultJobWallTime = ( RemoteWallClockTime > 4 * 24 * 60 * 60 )

RemoveAllJobsOlderThan2Weeks = (( CurrentTime - QDate > 60 * 60 * 24 * 14))

SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)ÂÂÂÂÂÂÂÂÂÂ || \
 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveMultipleRunJobs)ÂÂÂ || \
 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveDefaultJobWallTime) || \
 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveAllJobsOlderThan2Weeks)


On 04/03/2021 17.55, Brian Lin wrote:
Hi Thomas,

Jaime reminded me of another common cause of this issue: that the routed job is removed from under the CE so when the CE tries to transfer files back out to the submitter, it can't find the files it needs. Do you have any periodic removes in your local HTCondor config?

Thanks,
Brian

On 3/4/21 10:07 AM, Thomas Hartmann wrote:
Hi Brian

unfortunately, I have not found a smoking gun yet :-/

The CE is currently on [1].
selinux gets disabled by default and on quick check on the permissions , I did notice anything suspicious [2]. The files got the correct mapped user - including the CLUSTERID.log. Also the possible open file handles should be sufficient. On the fs side it is ext4 - so nothing fancy. And I do not see much I/O wait or so, which might point to an underlying issue with the HV.

I noticed several stack dumps on the CE. But AFAIS there has been no overlap between the affected PIDs/IDs and with these jobs.

Cheers,
 Thomas


[1]
condor-8.9.11-1.el7.x86_64
condor-boinc-7.16.11-1.el7.x86_64
condor-classads-8.9.11-1.el7.x86_64
condor-externals-8.9.11-1.el7.x86_64
condor-procd-8.9.11-1.el7.x86_64
htcondor-ce-4.4.1-3.el7.noarch
htcondor-ce-apel-4.4.1-3.el7.noarch
htcondor-ce-bdii-4.4.1-3.el7.noarch
htcondor-ce-client-4.4.1-3.el7.noarch
htcondor-ce-condor-4.4.1-3.el7.noarch
htcondor-ce-view-4.4.1-3.el7.noarch
python2-condor-8.9.11-1.el7.x86_64
python3-condor-8.9.11-1.el7.x86_64

CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64


[2]
root@grid-htcondorce0: [~] ls -all /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0
total 80
drwx------ 2 belleprd000 belleprd 4096 Mar 3 06:51 .
drwxr-xr-x 4 condor condor 4096 Mar 3 06:51 ..
-rw-r--r-- 1 belleprd000 belleprd 1028 Mar 3 10:36 406446.0.log
-rwxr-xr-x 1 belleprd000 belleprd 55919 Mar 3 06:51 DIRAC_nd5lYU_pilotwrapper.py
-rw------- 1 belleprd000 belleprd 10354 Mar 3 06:51 tmpBU9zHQ

> sestatus
SELinux status:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ disabled

> cat /proc/sys/fs/file-max
1552725




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature