[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)



and hi again,

maybe related to the issue of failing spooling/stages:
I monitored the behaviour of restarting the condor service unit. Thing is, that on our CEs' the schedd did not properly connect to the shadows. Since the restart took hardly any time, the schedd had been trying to reconnect to the shadows - however, it failed and gave up dropping all shadows.

What makes me supicious are messages in the ShadowLog log [1], that point to a problem delegating the grid proxies.

However, the dir and proxy file exists and are readable [2] - so I am a bit lost. AFAIS there is no hurdle to read/write from/top the FS and the proxy itself is parse'able with the local openssl version [3] But the behaviour looks similar to my other case, where files are not spooled form the CE to the LRMS Schedd??

Cheers and thanks for any ideas,
  Thomas


[1]
03/08/21 14:01:09 (9334.0) (2001211): relisock_gsi_get (read from socket) failure 03/08/21 14:01:09 (9334.0) (2001211): ReliSock::put_x509_delegation(): delegation failed: Failed to receive delegation request 03/08/21 14:01:09 (9334.0) (2001211): DoUpload: SHADOW at 131.169.223.119 failed to send file(s) to <131.169.160.33:41404>: error sending /var/lib/condor-ce/spool/5212/0/cluster5212.proc0.subproc0/tmpL6wa9T
03/08/21 14:01:09 (9334.0) (2001211): File transfer failed (status=0).


03/08/21 14:01:05 (9247.0) (2001032): condor_write() failed: send() 13 bytes to <131.169.162.103:34015> returned -1, timeout=0, errno=104 Connection reset by peer.
03/08/21 14:01:05 (9247.0) (2001032): Buf::write(): condor_write() failed
03/08/21 14:01:05 (9247.0) (2001032): ReliSock::put_x509_delegation(): delegation failed: globus_gsi_proxy: Error with X.509 request structure: Couldn't convert X509_REQ struct from DER encoded to internal form OpenSSL Error: a_d2i_fp.c:247: in library: asn1 encoding routines, function ASN1_D2I_READ_BIO: not enough data

03/08/21 14:01:05 (9247.0) (2001032): DoUpload: SHADOW at 131.169.223.119 failed to send file(s) to <131.169.162.103:34015>: error sending /var/lib/condor-ce/spool/5380/5/cluster5380.proc5.subproc0/tmpL6wa9T
03/08/21 14:01:05 (9247.0) (2001032): File transfer failed (status=0).

[2]
> ls -all /var/lib/condor-ce/spool/5380/5/cluster5380.proc5.subproc0
total 80
drwx------ 2 belleprd000 belleprd  4096 Mar  8 08:14 .
drwxr-xr-x 4 condor      condor    4096 Mar  8 08:14 ..
-rw-r--r-- 1 belleprd000 belleprd   528 Mar  8 13:57 5380.5.log
-rwxr-xr-x 1 belleprd000 belleprd 55919 Mar 8 08:14 DIRAC_d1xT5x_pilotwrapper.py
-rw------- 1 belleprd000 belleprd 10362 Mar  8 08:14 tmpL6wa9T

> openssl x509 -in /var/lib/condor-ce/spool/5380/5/cluster5380.proc5.subproc0/tmpL6wa9T -noout -text

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 898008020 (0x358683d4)
    Signature Algorithm: sha256WithRSAEncryption
...
         b5:6c:b2:b6:c2:12:b6:82:2c:bc:1a:06:8f:b3:dc:b7:7f:16:
         34:46

[3]
globus-gsi-openssl-error-4.2-1.el7.x86_64
globus-openssl-module-5.2-1.el7.x86_64
openssl-1.0.2k-21.el7_9.x86_64
openssl-libs-1.0.2k-21.el7_9.x86_64

condor-8.9.11-1.el7.x86_64
condor-boinc-7.16.11-1.el7.x86_64
condor-classads-8.9.11-1.el7.x86_64
condor-externals-8.9.11-1.el7.x86_64
condor-procd-8.9.11-1.el7.x86_64
htcondor-ce-4.4.1-3.el7.noarch
htcondor-ce-apel-4.4.1-3.el7.noarch
htcondor-ce-bdii-4.4.1-3.el7.noarch
htcondor-ce-client-4.4.1-3.el7.noarch
htcondor-ce-condor-4.4.1-3.el7.noarch
htcondor-ce-view-4.4.1-3.el7.noarch
python2-condor-8.9.11-1.el7.x86_64
python3-condor-8.9.11-1.el7.x86_64

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature