[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)



Hi Thomas,

Jaime reminded me of another common cause of this issue: that the routed job is removed from under the CE so when the CE tries to transfer files back out to the submitter, it can't find the files it needs. Do you have any periodic removes in your local HTCondor config?

Thanks,
Brian

On 3/4/21 10:07 AM, Thomas Hartmann wrote:
Hi Brian

unfortunately, I have not found a smoking gun yet :-/

The CE is currently on [1].
selinux gets disabled by default and on quick check on the permissions , I did notice anything suspicious [2]. The files got the correct mapped user - including the CLUSTERID.log. Also the possible open file handles should be sufficient. On the fs side it is ext4 - so nothing fancy. And I do not see much I/O wait or so, which might point to an underlying issue with the HV.

I noticed several stack dumps on the CE. But AFAIS there has been no overlap between the affected PIDs/IDs and with these jobs.

Cheers,
 Thomas


[1]
condor-8.9.11-1.el7.x86_64
condor-boinc-7.16.11-1.el7.x86_64
condor-classads-8.9.11-1.el7.x86_64
condor-externals-8.9.11-1.el7.x86_64
condor-procd-8.9.11-1.el7.x86_64
htcondor-ce-4.4.1-3.el7.noarch
htcondor-ce-apel-4.4.1-3.el7.noarch
htcondor-ce-bdii-4.4.1-3.el7.noarch
htcondor-ce-client-4.4.1-3.el7.noarch
htcondor-ce-condor-4.4.1-3.el7.noarch
htcondor-ce-view-4.4.1-3.el7.noarch
python2-condor-8.9.11-1.el7.x86_64
python3-condor-8.9.11-1.el7.x86_64

CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64


[2]
root@grid-htcondorce0: [~] ls -all /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0
total 80
drwx------ 2 belleprd000 belleprd 4096 Mar 3 06:51 .
drwxr-xr-x 4 condor condor 4096 Mar 3 06:51 ..
-rw-r--r-- 1 belleprd000 belleprd 1028 Mar 3 10:36 406446.0.log
-rwxr-xr-x 1 belleprd000 belleprd 55919 Mar 3 06:51 DIRAC_nd5lYU_pilotwrapper.py
-rw------- 1 belleprd000 belleprd 10354 Mar 3 06:51 tmpBU9zHQ

> sestatus
SELinux status:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ disabled

> cat /proc/sys/fs/file-max
1552725