Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)

Date: Fri, 05 Mar 2021 12:06:00 +0100
From: Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)

Hi again,

maybe related(??) - I just noticed, that a restart of the condor unitcaused the Schedd to loose all its jobs [1]. Since the restart was moreor less instantaneous, I would have expected the Sched to pick up its jobs.


Cheers,
  Thomas

[1]

03/05/21 10:55:11 (pid:3997828) WARNING - Cluster 437906 was deletedwith proc ads still attached to it. This should only happen duringschedd shutdown.

[2]

Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopping CondorDistributed High-Throughput-Computing...

-- Subject: Unit condor.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit condor.service has begun shutting down.

Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopped CondorDistributed High-Throughput-Computing.

-- Subject: Unit condor.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit condor.service has finished shutting down.

Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Starting CondorDistributed High-Throughput-Computing...

-- Subject: Unit condor.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit condor.service has begun starting up.

Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Started CondorDistributed High-Throughput-Computing.

-- Subject: Unit condor.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel



On 05/03/2021 10.26, Thomas Hartmann wrote:

Hi Brian,
yes, we have periodic removes [1]. But 'in principle' these shouldmostly only work on longer time scales ~O(days) - except for theJobRunCount hedge.Idea behind the `JobRunCount > 1` is to avboid automatic reruns of jobsas to avoid clashes with the VO factories, if these would resend thejobs again on a failure and would then result in two job instances.
But the problem with the missing out/err affected also CLUSTERID.0 jobs,that should be the initial iteration and not fall under `JobRunCount >1`, or?
Cheers,
 Â Thomas

[1]
 > grep -v "#" /etc/condor/config.d/90_21_condor_cleanup.conf
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus)> 60 * 60 * 24 * 2) )
RemoveMultipleRunJobs = ( JobRunCount > 1 )

RemoveDefaultJobWallTime = ( RemoteWallClockTime > 4 * 24 * 60 * 60 )

RemoveAllJobsOlderThan2Weeks = (( CurrentTime - QDate > 60 * 60 * 24 * 14))

SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)ÂÂÂÂÂÂÂÂÂÂ || \
 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveMultipleRunJobs)ÂÂÂ || \
 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveDefaultJobWallTime) || \
 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ $(RemoveAllJobsOlderThan2Weeks)


On 04/03/2021 17.55, Brian Lin wrote:
Hi Thomas,
Jaime reminded me of another common cause of this issue: that therouted job is removed from under the CE so when the CE tries totransfer files back out to the submitter, it can't find the files itneeds. Do you have any periodic removes in your local HTCondor config?
Thanks,
Brian

On 3/4/21 10:07 AM, Thomas Hartmann wrote:
Hi Brian

unfortunately, I have not found a smoking gun yet :-/

The CE is currently on [1].
selinux gets disabled by default and on quick check on thepermissions , I did notice anything suspicious [2]. The files got thecorrect mapped user - including the CLUSTERID.log. Also the possibleopen file handles should be sufficient.On the fs side it is ext4 - so nothing fancy. And I do not see muchI/O wait or so, which might point to an underlying issue with the HV.
I noticed several stack dumps on the CE. But AFAIS there has been nooverlap between the affected PIDs/IDs and with these jobs.
Cheers,
Â Thomas


[1]
condor-8.9.11-1.el7.x86_64
condor-boinc-7.16.11-1.el7.x86_64
condor-classads-8.9.11-1.el7.x86_64
condor-externals-8.9.11-1.el7.x86_64
condor-procd-8.9.11-1.el7.x86_64
htcondor-ce-4.4.1-3.el7.noarch
htcondor-ce-apel-4.4.1-3.el7.noarch
htcondor-ce-bdii-4.4.1-3.el7.noarch
htcondor-ce-client-4.4.1-3.el7.noarch
htcondor-ce-condor-4.4.1-3.el7.noarch
htcondor-ce-view-4.4.1-3.el7.noarch
python2-condor-8.9.11-1.el7.x86_64
python3-condor-8.9.11-1.el7.x86_64

CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64


[2]
root@grid-htcondorce0: [~] ls -all/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0
total 80
drwx------ 2 belleprd000 belleprdÂ 4096 MarÂ 3 06:51 .
drwxr-xr-x 4 condorÂÂÂÂÂ condorÂÂÂ 4096 MarÂ 3 06:51 ..
-rw-r--r-- 1 belleprd000 belleprdÂ 1028 MarÂ 3 10:36 406446.0.log
-rwxr-xr-x 1 belleprd000 belleprd 55919 MarÂ 3 06:51DIRAC_nd5lYU_pilotwrapper.py
-rw------- 1 belleprd000 belleprd 10354 MarÂ 3 06:51 tmpBU9zHQ

> sestatus
SELinux status:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ disabled

> cat /proc/sys/fs/file-max
1552725
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Brian Lin

References:
- [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Brian Lin
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Brian Lin
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann

Prev by Date: Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
Next by Date: Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
Previous by thread: Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
Next by thread: Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)