[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)



Hi all,

I am trying to understand several jobs, that entered our CondorCE and went into hold, when their submission to Condor failed due to the Out & Err files missing(?)

For example CondorCE job 406446.0 that got in principle routed to 422981.0 [1]. The CE-job's spool directory exists [2].

However, the resulting Condor job fails during submission, when the stderr and stdout filers are tried to be opened for reading as far as I see [3]

I am not sure, if the OPut & Err file make much sense in this stage for a job (and if one could -as a quick fix- replace them in a route by /dev/null or so)?

Cheers,
  Thomas

[1]
ClusterId = 406446
UserLog = "406446.0.log"
GlobalJobId = "grid-htcondorce0.desy.de#406446.0#1614750664"
Environment = "HTCONDOR_JOBID=406446.0"
Iwd = "/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0"
RoutedToJobId = "422981.0"

Out = "406446.0.out"
Err = "406446.0.err"

[2]
> ls /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0
406446.0.log  DIRAC_nd5lYU_pilotwrapper.py  tmpBU9zHQ


[3]
03/04/21 13:25:04 (cid:107) Transferring files for jobs 406446.0
03/04/21 13:25:04 (cid:107) spoolJobFiles(): started worker process
03/04/21 13:25:04 The submitting job ad as the FileTransferObject sees it
...
xcount = 1
03/04/21 13:25:04 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0/406446.0.err': No such file or directory (errno: 2, si_error: 1) 03/04/21 13:25:04 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0/406446.0.out': No such file or directory (errno: 2, si_error: 1) 03/04/21 13:25:04 DoUpload: (Condor error code 13, subcode 2) SCHEDD at 131.169.223.129 failed to send file(s) to <202.13.206.84:37118>: error reading from /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0/406446.0.err: (errno 2) No such file or directory; TOOL failed to receive file(s) from <131.169.223.129:9619> 03/04/21 13:25:04 (cid:107) generalJobFilesWorkerThread(): failed to transfer files for job 406446.0 03/04/21 13:25:04 condor_write(): Socket closed when trying to write 29 bytes to <202.13.206.84:37118>, fd is 21
03/04/21 13:25:04 Buf::write(): condor_write() failed
03/04/21 13:25:04 ERROR - Staging of job files failed!


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature