Re: [HTCondor-users] Jobs failing to transfer files "condor_write(): Socket closed when trying to write"

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

04/15/20 16:15:50 (pid:119763) Submitting machine is "lcgce01.phy.bris.ac.uk"
04/15/20 16:15:50 (pid:119763) setting the orig job name in starter
04/15/20 16:15:50 (pid:119763) setting the orig job iwd in starter
04/15/20 16:15:50 (pid:119763) Chirp config summary: IO false, Updates false, Delayed updates true.
04/15/20 16:15:50 (pid:119763) Initialized IO Proxy.
04/15/20 16:15:50 (pid:119763) Done setting resource limits
04/15/20 16:15:50 (pid:119763) Got SIGTERM. Performing graceful shutdown.
04/15/20 16:15:50 (pid:119763) ShutdownGraceful all jobs.
04/15/20 16:15:50 (pid:119763) Failed to open '.update.ad' to read update ad: No such file or directory (2).
04/15/20 16:15:50 (pid:119763) All jobs have exited... starter exiting
04/15/20 16:15:50 (pid:119763) After chmod(), still can't remove "/condor/condor/dir_119763" as directory owner, giving up!
04/15/20 16:15:50 (pid:119763) **** condor_starter (condor_STARTER) pid 119763 EXITING WITH STATUS 0
04/15/20 16:15:50 (pid:119763) ERROR "Assertion ERROR on (daemonCore)" at line 4032 in file /var/lib/condor/execute/slot2/dir_13658/userdir/.tmpgZgx4N/BUILD/condor-8.8.8/src/condor_utils/file_transfer.cpp
04/15/20 16:15:51 (pid:119772) ******************************************************
04/15/20 16:15:51 (pid:119772) ** condor_starter (CONDOR_STARTER) STARTING UP
04/15/20 16:15:51 (pid:119772) ** /usr/sbin/condor_starter
04/15/20 16:15:51 (pid:119772) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/15/20 16:15:51 (pid:119772) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/15/20 16:15:51 (pid:119772) ** $CondorVersion: 8.8.8 Mar 19 2020 BuildID: 498525 PackageID: 8.8.8-1 $
04/15/20 16:15:51 (pid:119772) ** $CondorPlatform: x86_64_CentOS7 $
04/15/20 16:15:51 (pid:119772) ** PID = 119772
04/15/20 16:15:51 (pid:119772) ** Log last touched 4/15 16:15:50
04/15/20 16:15:51 (pid:119772) ******************************************************
04/15/20 16:15:51 (pid:119772) Using config source: /etc/condor/condor_config
04/15/20 16:15:51 (pid:119772) Using local config sources:
04/15/20 16:15:51 (pid:119772)    /etc/condor/config.d/00_config_local.config
04/15/20 16:15:51 (pid:119772)    /etc/condor/config.d/10_security.config
04/15/20 16:15:51 (pid:119772)    /etc/condor/config.d/20_workernode.config
04/15/20 16:15:51 (pid:119772)    /etc/condor/config.d/50_singularity.config
04/15/20 16:15:51 (pid:119772)    /etc/condor/config.d/60_docker.config
04/15/20 16:15:51 (pid:119772)    /etc/condor/config.d/888_ipv6_off.config
04/15/20 16:15:51 (pid:119772)    /etc/condor/condor_config.local
04/15/20 16:15:51 (pid:119772) config Macros = 162, Sorted = 161, StringBytes = 6393, TablesBytes = 5920
04/15/20 16:15:51 (pid:119772) CLASSAD_CACHING is OFF
04/15/20 16:15:51 (pid:119772) Daemon Log is logging: D_ALWAYS D_ERROR
04/15/20 16:15:51 (pid:119772) SharedPortEndpoint: waiting for connections to named socket 114634_c5ef_13
04/15/20 16:15:51 (pid:119772) DaemonCore: command socket at <10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=114634_c5ef_13>
04/15/20 16:15:51 (pid:119772) DaemonCore: private command socket at <10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=114634_c5ef_13>
04/15/20 16:15:51 (pid:119772) Communicating with shadow <137.222.79.6:9618?addrs=137.222.79.6-9618+[2001-630-e4-2810-137-222-79-6]-9618&noUDP&sock=2337399_cbe3_600>
04/15/20 16:15:51 (pid:119772) Submitting machine is "lcgce01.phy.bris.ac.uk"
04/15/20 16:15:51 (pid:119772) setting the orig job name in starter
04/15/20 16:15:51 (pid:119772) setting the orig job iwd in starter
04/15/20 16:15:51 (pid:119772) Chirp config summary: IO false, Updates false, Delayed updates true.
04/15/20 16:15:51 (pid:119772) Initialized IO Proxy.
04/15/20 16:15:51 (pid:119772) Done setting resource limits
04/15/20 16:15:51 (pid:119772) Got SIGTERM. Performing graceful shutdown.
04/15/20 16:15:51 (pid:119772) ShutdownGraceful all jobs.
04/15/20 16:15:51 (pid:119772) Failed to open '.update.ad' to read update ad: No such file or directory (2).
04/15/20 16:15:51 (pid:119772) All jobs have exited... starter exiting
04/15/20 16:15:51 (pid:119775) get_file(): Failed to open file /condor/condor/dir_119772/.gahp_complete, errno = 2: No such file or directory.
04/15/20 16:15:51 (pid:119775) get_file(): consumed 0 bytes of file transmission
04/15/20 16:15:51 (pid:119775) DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.129.5.223
failed to write to file /condor/condor/dir_119772/.gahp_complete: (errno 2) No such file or directory
04/15/20 16:15:51 (pid:119775) get_file(): Failed to open file /condor/condor/dir_119772/glidein_startup.sh, errno = 2: No such file or directory.
04/15/20 16:15:51 (pid:119772) **** condor_starter (condor_STARTER) pid 119772 EXITING WITH STATUS 0
04/15/20 16:15:51 (pid:119772) ERROR "Assertion ERROR on (daemonCore)" at line 4032 in file /var/lib/condor/execute/slot2/dir_13658/userdir/.tmpgZgx4N/BUILD/condor-8.8.8/src/condor_utils/file_transfer.cpp
04/15/20 16:15:51 (pid:119775) get_file(): consumed 69546 bytes of file transmission
04/15/20 16:15:51 (pid:119775) DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.129.5.223
failed to write to file /condor/condor/dir_119772/glidein_startup.sh: (errno 2) No such file or directory
04/15/20 16:15:51 (pid:119775) get_file(): Failed to open file /condor/condor/dir_119772/condor_exec.exe, errno = 2: No such file or directory.
04/15/20 16:15:51 (pid:119775) get_file(): consumed 13373 bytes of file transmission
04/15/20 16:15:51 (pid:119775) DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.129.5.223
failed to write to file /condor/condor/dir_119772/condor_exec.exe: (errno 2) No such file or directory
04/15/20 16:15:51 (pid:119775) Failed to set execute bit on /condor/condor/dir_119772/condor_exec.exe, errno=2 (No such file or directory)

04/15/20 16:15:51 config Macros = 142, Sorted = 142, StringBytes = 6011, TablesBytes = 2352
04/15/20 16:15:51 CLASSAD_CACHING is OFF
04/15/20 16:15:51 Daemon Log is logging: D_ALWAYS D_ERROR
04/15/20 16:15:51 SharedPortEndpoint: waiting for connections to named socket 2337399_cbe3_600
04/15/20 16:15:51 DaemonCore: command socket at <137.222.79.6:9618?addrs=137.222.79.6-9618+[2001-630-e4-2810-137-222-79-6]-9618&noUDP&sock=2337399_cbe3_600>
04/15/20 16:15:51 DaemonCore: private command socket at <137.222.79.6:9618?addrs=137.222.79.6-9618+[2001-630-e4-2810-137-222-79-6]-9618&noUDP&sock=2337399_cbe3_600>
04/15/20 16:15:51 Initializing a VANILLA shadow for job 3712168.0
04/15/20 16:15:51 (3712168.0) (2359091): Request to run on slot1_1@xxxxxxxxxxxxxx <10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3> was ACCEPTED
04/15/20 16:15:51 (3712168.0) (2359091): Job 3712168.0 going into Hold state (code 5,0): The job attribute PeriodicRemove _expression_ 'false || ResidentSetSize > JobMemoryLimit' evaluated to UNDEFINED
04/15/20 16:15:51 (3712168.0) (2359091): Job 3712168.0 going into Hold state (code 5,0): The job attribute PeriodicRemove _expression_ 'false || ResidentSetSize > JobMemoryLimit' evaluated to UNDEFINED
04/15/20 16:15:51 (3712168.0) (2359091): DoUpload: (Condor error code 12, subcode 2) SHADOW at 137.222.79.6 failed to send file(s) to <10.129.5.223:38734>; STARTER at 10.129.5.223 failed to write to file /condor/condor/dir_119772/condor_exec.exe: (errno 2) No such file or directory
04/15/20 16:15:51 (3712168.0) (2359091): File transfer failed (status=0).
04/15/20 16:15:52 (3712168.0) (2359091): Job 3712168.0 is being evicted from slot1_1@xxxxxxxxxxxxxx
04/15/20 16:15:52 (3712168.0) (2359091): logEvictEvent with unknown reason (112), not logging.
04/15/20 16:15:52 (3712168.0) (2359091): **** condor_shadow (condor_SHADOW) pid 2359091 EXITING WITH STATUS 112

The starter log on the execute machine and the shadow log on the submit machine will have more details on the file transfers.

In attempting to reproduce the failure with a local submission, have you tried replicating the exact set of input files to be transferred?

- Jaime

On Apr 15, 2020, at 11:30 AM, Luke Kreczko <L.Kreczko@xxxxxxxxxxxxx> wrote:

Dear HTCondor experts,

After an HTCondor upgrade (8.6 -> 8.8), some jobs have issues transferring files.

Jobs submitted directly to the scheduler (e.g. user jobs) seem to always succeed (or I am unable to replicate the exact behaviour),

but grid jobs don't except for the first file(submission via ARC CE).

Before I ask on the ARC CE mailing list, I was wondering if there as another HTCondor adjustment I could make.

A full StartLog example is given in [1] and the transfer history in [2].

Note that the first file succeeds (the proxy), but consequent files fail.

We tested the combinations of

- ARC CE + HTcondor 8.6, WNs 8.6 --> works

- ARC CE + HTcondor 8.8, WNs 8.6 --> works

- ARC CE + HTcondor 8.6, WNs 8.8 --> fails

- ARC CE + HTcondor 8.8, WNs 8.8 --> fails

So the issue seems to be on the worker node side with HTCondor 8.8 (also tested with latest 8.8.8).

Any help or debugging advice is welcome.

Cheers,

Luke

[1]

/var/log/condor/StartLog

137.222.79.6 --> ARC CE with HTCondor 8.8 schedd [IPv4 + IPv6]

10.129.5.223 --> HTCondor WN (8.8.8) [IPv4 only]

04/15/20 16:15:50 slot1_1: New machine resource of type -1 allocated
04/15/20 16:15:50 Setting up slot pairings
04/15/20 16:15:50 slot1_1: Request accepted.
04/15/20 16:15:50 slot1_1: Remote owner is cmspil041@xxxxxxxxxxxxxx
04/15/20 16:15:50 slot1_1: State change: claiming protocol successful
04/15/20 16:15:50 slot1_1: Changing state: Owner -> Claimed
04/15/20 16:15:50 slot1_1: Got activate_claim request from shadow (137.222.79.6)
04/15/20 16:15:50 slot1_1: Remote job ID is 3712167.0
04/15/20 16:15:50 slot1_1: Got universe "VANILLA" (5) from request classad
04/15/20 16:15:50 slot1_1: State change: claim-activation protocol successful
04/15/20 16:15:50 slot1_1: Changing activity: Idle -> Busy
04/15/20 16:15:50 slot1_1: Called deactivate_claim()
04/15/20 16:15:50 Starter pid 119763 exited with status 4
04/15/20 16:15:50 slot1_1: State change: starter exited
04/15/20 16:15:50 slot1_1: Changing activity: Busy -> Idle
04/15/20 16:15:51 slot1_1: Got activate_claim request from shadow (137.222.79.6)
04/15/20 16:15:51 slot1_1: Remote job ID is 3712168.0
04/15/20 16:15:51 slot1_1: Got universe "VANILLA" (5) from request classad
04/15/20 16:15:51 slot1_1: State change: claim-activation protocol successful
04/15/20 16:15:51 slot1_1: Changing activity: Idle -> Busy
04/15/20 16:15:51 slot1_1: Called deactivate_claim()
04/15/20 16:15:51 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
04/15/20 16:15:51 Starter pid 119772 exited with status 4
04/15/20 16:15:51 slot1_1: State change: starter exited
04/15/20 16:15:51 slot1_1: State change: No preempting claim, returning to owner
04/15/20 16:15:51 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/15/20 16:15:51 slot1_1: State change: IS_OWNER is false
04/15/20 16:15:51 slot1_1: Changing state: Owner -> Unclaimed
04/15/20 16:15:51 slot1_1: Changing state: Unclaimed -> Delete
04/15/20 16:15:51 slot1_1: Resource no longer needed, deleting
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
for 444 (ACTIVATE_CLAIM)
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
-- perhaps this claim was already removed?
04/15/20 16:15:52 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
04/15/20 16:15:52 Error: can't find resource with ClaimId (<10.129.5.223:9618?addrs=10.129.5.223-9618&noUDP&sock=10159_ceb7_3>#1586962562#31#...)
for 443 (RELEASE_CLAIM); perhaps this claim was removed already.
04/15/20 16:15:52 condor_write(): Socket closed when trying to write 29 bytes to <137.222.79.6:25987>, fd is 15
04/15/20 16:15:52 Buf::write(): condor_write() failed

[2]

/var/log/condor/transfer_history
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0008680820465087891
TransferEndTime = 1586963751.451765
TransferFileBytes = 17019
TransferStartTime = 1586963751.450897
TransferSuccess = true
TransferTotalBytes = 17019
TransferFileName = "job.PkEMDmC3xiwnvnizHq7yZUKmABFKDmABFKDm6ReTDmEBFKDmxHuBCo.proxy"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168

***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0001161098480224609
TransferEndTime = 1586963751.452136
TransferFileBytes = 0
TransferStartTime = 1586963751.45202
TransferSuccess = false
TransferTotalBytes = 0
TransferFileName = ".gahp_complete"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.006474018096923828
TransferEndTime = 1586963751.458725
TransferFileBytes = 69546
TransferStartTime = 1586963751.452251
TransferSuccess = false
TransferTotalBytes = 69546
TransferFileName = "glidein_startup.sh"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***
JobOwner = "cmspil041"
JobProcId = 0
ConnectionTimeSeconds = 0.0001227855682373047
TransferEndTime = 1586963751.458953
TransferFileBytes = 13373
TransferStartTime = 1586963751.45883
TransferSuccess = false
TransferTotalBytes = 13373
TransferFileName = "condor_exec.exe"
TransferType = "download"
TransferProtocol = "cedar"
JobClusterId = 3712168
***

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Jobs failing to transfer files "condor_write(): Socket closed when trying to write"