[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs failing to transfer files "condor_write(): Socket closed when trying to write"



Dear HTCondor experts,

After an HTCondor upgrade (8.6 -> 8.8), some jobs have issues transferring files.
Jobs submitted directly to the scheduler (e.g. user jobs) seem to always succeed (or I am unable to replicate the exact behaviour),
but grid jobs don't except for the first file(submission via ARC CE).

Before I ask on the ARC CE mailing list, I was wondering if there as another HTCondor adjustment I could make.
A full StartLog example is given in [1] and the transfer history in [2]. 
Note that the first file succeeds (the proxy), but consequent files fail.

We tested the combinations of
 - ARC CE + HTcondor 8.6, WNs 8.6 --> works
 - ARC CE + HTcondor 8.8, WNs 8.6 --> works
 - ARC CE + HTcondor 8.6, WNs 8.8 --> fails
 - ARC CE + HTcondor 8.8, WNs 8.8 --> fails

So the issue seems to be on the worker node side with HTCondor 8.8 (also tested with latest 8.8.8).

Any help or debugging advice is welcome.

Cheers,
Luke

[1]
/var/log/condor/StartLog
137.222.79.6 --> ARC CE with HTCondor 8.8 schedd [IPv4 + IPv6]
10.129.5.223 --> HTCondor WN (8.8.8) [IPv4 only]
04/15/20 16:15:50 slot1_1: New machine resource of type -1 allocated
04/15/20 16:15:50 Setting up slot pairings
04/15/20 16:15:50 slot1_1: Request accepted.
04/15/20 16:15:50 slot1_1: Remote owner is cmspil041@xxxxxxxxxxxxxx
04/15/20 16:15:50 slot1_1: State change: claiming protocol successful
04/15/20 16:15:50 slot1_1: Changing state: Owner -> Claimed
04/15/20 16:15:50 slot1_1: Got activate_claim request from shadow (137.222.79.6)
04/15/20 16:15:50 slot1_1: Remote job ID is 3712167.0
04/15/20 16:15:50 slot1_1: Got universe "VANILLA" (5) from request classad
04/15/20 16:15:50 slot1_1: State change: claim-activation protocol successful
04/15/20 16:15:50 slot1_1: Changing activity: Idle -> Busy
04/15/20 16:15:50 slot1_1: Called deactivate_claim()
04/15/20 16:15:50 Starter pid 119763 exited with status 4
04/15/20 16:15:50 slot1_1: State change: starter exited
04/15/20 16:15:50 slot1_1: Changing activity: Busy -> Idle
04/15/20 16:15:51 slot1_1: Got activate_claim request from shadow (137.222.79.6)
04/15/20 16:15:51 slot1_1: Remote job ID is 3712168.0
04/15/20 16:15:51 slot1_1: Got universe "VANILLA" (5) from request classad
04/15/20 16:15:51 slot1_1: State change: claim-activation protocol successful
04/15/20 16:15:51 slot1_1: Changing activity: Idle -> Busy
04/15/20 16:15:51 slot1_1: Called deactivate_claim()
04/15/20 16:15:51 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
04/15/20 16:15:51 Starter pid 119772 exited with status 4
04/15/20 16:15:51 slot1_1: State change: starter exited
04/15/20 16:15:51 slot1_1: State change: No preempting claim, returning to owner
04/15/20 16:15:51 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/15/20 16:15:51 slot1_1: State change: IS_OWNER is false
04/15/20 16:15:51 slot1_1: Changing state: Owner -> Unclaimed
04/15/20 16:15:51 slot1_1: Changing state: Unclaimed -> Delete
04/15/20 16:15:51 slot1_1: Resource no longer needed, deleting
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
for 444 (ACTIVATE_CLAIM)
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
-- perhaps this claim was already removed?
04/15/20 16:15:52 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
for 443 (RELEASE_CLAIM); perhaps this claim was removed already.
04/15/20 16:15:52 condor_write(): Socket closed when trying to write 29 bytes to <137.222.79.6:25987>, fd is 15
04/15/20 16:15:52 Buf::write(): condor_write() failed

[2]
/var/log/condor/transfer_history
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0008680820465087891
TransferEndTime = 1586963751.451765
TransferFileBytes = 17019
TransferStartTime = 1586963751.450897
TransferSuccess = true
TransferTotalBytes = 17019
TransferFileName = "job.PkEMDmC3xiwnvnizHq7yZUKmABFKDmABFKDm6ReTDmEBFKDmxHuBCo.proxy"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0001161098480224609
TransferEndTime = 1586963751.452136
TransferFileBytes = 0
TransferStartTime = 1586963751.45202
TransferSuccess = false
TransferTotalBytes = 0
TransferFileName = ".gahp_complete"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.006474018096923828
TransferEndTime = 1586963751.458725
TransferFileBytes = 69546
TransferStartTime = 1586963751.452251
TransferSuccess = false
TransferTotalBytes = 69546
TransferFileName = "glidein_startup.sh"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0001227855682373047
TransferEndTime = 1586963751.458953
TransferFileBytes = 13373
TransferStartTime = 1586963751.45883
TransferSuccess = false
TransferTotalBytes = 13373
TransferFileName = "condor_exec.exe"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***