[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)



Hi Brian

unfortunately, I have not found a smoking gun yet :-/

The CE is currently on [1].
selinux gets disabled by default and on quick check on the permissions , I did notice anything suspicious [2]. The files got the correct mapped user - including the CLUSTERID.log. Also the possible open file handles should be sufficient. On the fs side it is ext4 - so nothing fancy. And I do not see much I/O wait or so, which might point to an underlying issue with the HV.

I noticed several stack dumps on the CE. But AFAIS there has been no overlap between the affected PIDs/IDs and with these jobs.

Cheers,
  Thomas


[1]
condor-8.9.11-1.el7.x86_64
condor-boinc-7.16.11-1.el7.x86_64
condor-classads-8.9.11-1.el7.x86_64
condor-externals-8.9.11-1.el7.x86_64
condor-procd-8.9.11-1.el7.x86_64
htcondor-ce-4.4.1-3.el7.noarch
htcondor-ce-apel-4.4.1-3.el7.noarch
htcondor-ce-bdii-4.4.1-3.el7.noarch
htcondor-ce-client-4.4.1-3.el7.noarch
htcondor-ce-condor-4.4.1-3.el7.noarch
htcondor-ce-view-4.4.1-3.el7.noarch
python2-condor-8.9.11-1.el7.x86_64
python3-condor-8.9.11-1.el7.x86_64

CentOS Linux release 7.9.2009 (Core) @ 3.10.0-1160.11.1.el7.x86_64


[2]
root@grid-htcondorce0: [~] ls -all /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0
total 80
drwx------ 2 belleprd000 belleprd  4096 Mar  3 06:51 .
drwxr-xr-x 4 condor      condor    4096 Mar  3 06:51 ..
-rw-r--r-- 1 belleprd000 belleprd  1028 Mar  3 10:36 406446.0.log
-rwxr-xr-x 1 belleprd000 belleprd 55919 Mar 3 06:51 DIRAC_nd5lYU_pilotwrapper.py
-rw------- 1 belleprd000 belleprd 10354 Mar  3 06:51 tmpBU9zHQ

> sestatus
SELinux status:                 disabled

> cat /proc/sys/fs/file-max
1552725

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature