[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs failing to transfer files "condor_write(): Socket closed when trying to write"



The starter log on the execute machine and the shadow log on the submit machine will have more details on the file transfers.

In attempting to reproduce the failure with a local submission, have you tried replicating the exact set of input files to be transferred?

 - Jaime

On Apr 15, 2020, at 11:30 AM, Luke Kreczko <L.Kreczko@xxxxxxxxxxxxx> wrote:

Dear HTCondor experts,

After an HTCondor upgrade (8.6 -> 8.8), some jobs have issues transferring files.
Jobs submitted directly to the scheduler (e.g. user jobs) seem to always succeed (or I am unable to replicate the exact behaviour),
but grid jobs don't except for the first file(submission via ARC CE).

Before I ask on the ARC CE mailing list, I was wondering if there as another HTCondor adjustment I could make.
A full StartLog example is given in [1] and the transfer history in [2]. 
Note that the first file succeeds (the proxy), but consequent files fail.

We tested the combinations of
 - ARC CE + HTcondor 8.6, WNs 8.6 --> works
 - ARC CE + HTcondor 8.8, WNs 8.6 --> works
 - ARC CE + HTcondor 8.6, WNs 8.8 --> fails
 - ARC CE + HTcondor 8.8, WNs 8.8 --> fails

So the issue seems to be on the worker node side with HTCondor 8.8 (also tested with latest 8.8.8).

Any help or debugging advice is welcome.

Cheers,
Luke

[1]
/var/log/condor/StartLog
137.222.79.6 --> ARC CE with HTCondor 8.8 schedd [IPv4 + IPv6]
10.129.5.223 --> HTCondor WN (8.8.8) [IPv4 only]
04/15/20 16:15:50 slot1_1: New machine resource of type -1 allocated
04/15/20 16:15:50 Setting up slot pairings
04/15/20 16:15:50 slot1_1: Request accepted.
04/15/20 16:15:50 slot1_1: Remote owner is cmspil041@xxxxxxxxxxxxxx
04/15/20 16:15:50 slot1_1: State change: claiming protocol successful
04/15/20 16:15:50 slot1_1: Changing state: Owner -> Claimed
04/15/20 16:15:50 slot1_1: Got activate_claim request from shadow (137.222.79.6)
04/15/20 16:15:50 slot1_1: Remote job ID is 3712167.0
04/15/20 16:15:50 slot1_1: Got universe "VANILLA" (5) from request classad
04/15/20 16:15:50 slot1_1: State change: claim-activation protocol successful
04/15/20 16:15:50 slot1_1: Changing activity: Idle -> Busy
04/15/20 16:15:50 slot1_1: Called deactivate_claim()
04/15/20 16:15:50 Starter pid 119763 exited with status 4
04/15/20 16:15:50 slot1_1: State change: starter exited
04/15/20 16:15:50 slot1_1: Changing activity: Busy -> Idle
04/15/20 16:15:51 slot1_1: Got activate_claim request from shadow (137.222.79.6)
04/15/20 16:15:51 slot1_1: Remote job ID is 3712168.0
04/15/20 16:15:51 slot1_1: Got universe "VANILLA" (5) from request classad
04/15/20 16:15:51 slot1_1: State change: claim-activation protocol successful
04/15/20 16:15:51 slot1_1: Changing activity: Idle -> Busy
04/15/20 16:15:51 slot1_1: Called deactivate_claim()
04/15/20 16:15:51 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
04/15/20 16:15:51 Starter pid 119772 exited with status 4
04/15/20 16:15:51 slot1_1: State change: starter exited
04/15/20 16:15:51 slot1_1: State change: No preempting claim, returning to owner
04/15/20 16:15:51 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/15/20 16:15:51 slot1_1: State change: IS_OWNER is false
04/15/20 16:15:51 slot1_1: Changing state: Owner -> Unclaimed
04/15/20 16:15:51 slot1_1: Changing state: Unclaimed -> Delete
04/15/20 16:15:51 slot1_1: Resource no longer needed, deleting
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
for 444 (ACTIVATE_CLAIM)
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
-- perhaps this claim was already removed?
04/15/20 16:15:52 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
for 443 (RELEASE_CLAIM); perhaps this claim was removed already.
04/15/20 16:15:52 condor_write(): Socket closed when trying to write 29 bytes to <137.222.79.6:25987>, fd is 15
04/15/20 16:15:52 Buf::write(): condor_write() failed

[2]
/var/log/condor/transfer_history
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0008680820465087891
TransferEndTime = 1586963751.451765
TransferFileBytes = 17019
TransferStartTime = 1586963751.450897
TransferSuccess = true
TransferTotalBytes = 17019
TransferFileName = "job.PkEMDmC3xiwnvnizHq7yZUKmABFKDmABFKDm6ReTDmEBFKDmxHuBCo.proxy"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0001161098480224609
TransferEndTime = 1586963751.452136
TransferFileBytes = 0
TransferStartTime = 1586963751.45202
TransferSuccess = false
TransferTotalBytes = 0
TransferFileName = ".gahp_complete"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.006474018096923828
TransferEndTime = 1586963751.458725
TransferFileBytes = 69546
TransferStartTime = 1586963751.452251
TransferSuccess = false
TransferTotalBytes = 69546
TransferFileName = "glidein_startup.sh"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0001227855682373047
TransferEndTime = 1586963751.458953
TransferFileBytes = 13373
TransferStartTime = 1586963751.45883
TransferSuccess = false
TransferTotalBytes = 13373
TransferFileName = "condor_exec.exe"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/