[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] (no subject)



Hello Experts,Â

We are dynamically bringing the condor worker nodes up in the cloud to run MPI jobs. While querying the status of schedd using python binding following error is reported multiple times. Sometimes this issue disappears in a few minutes and many times it lingers for a couple of hours. This issue seems to be happening only for MPI jobs.Â
[xzhou@condor1-submit z2]$ condor_q-- Failed to fetch ads from: <xx.xxx.128.20:9618?addrs=x.xxx.128.20-9618&noUDP&sock=2486_d180_15> : condor1-submit.c.example.internal
SECMAN:2007:Failed to end classad message.
Following error messages reported in schedd log:

07/15/20 10:10:04 (pid:475690) attempt to connect to <xx.xxx.136.76:9618> failed: timed out after 10 seconds.
07/15/20 10:10:04 (pid:475690) ERROR: SECMAN:2003:TCP connection to <xx.xxx.136.76:9618> failed.
07/15/20 10:10:04 (pid:475690) condor_write(): Socket closed when trying to write 188 bytes to <xx.xxx.136.76:9618>, fd is 18, errno=107 Transport endpoint i
s not connected
07/15/20 10:10:04 (pid:475690) Buf::write(): condor_write() failed
07/15/20 10:10:04 (pid:475690) Inserting new attribute Scheduler into non-active cluster cid=23324 acid=-1
07/15/20 10:10:04 (pid:475690) Found 0 potential dedicated resources in 0 seconds
07/15/20 10:10:04 (pid:475690) Skipping job 23324.0 because it requests more nodes (50) than exist in the pool (0)
07/15/20 10:10:04 (pid:475690) attempt to connect to <xx.xxx.128.15:9618> failed: Broken pipe (connect errno = 32).
07/15/20 10:10:04 (pid:475690) condor_write(): Socket closed when trying to write 13 bytes to , fd is 18
07/15/20 10:10:04 (pid:475690) Buf::write(): condor_write() failed
07/15/20 10:10:04 (pid:475690) SharedPortEndpoint: failed to send final status (success) for SHARED_PORT_PASS_SOCK
07/15/20 10:10:04 (pid:475690) condor_write(): Socket closed when trying to write 340 bytes to <xx.xxx.128.20:3301>, fd is 20
07/15/20 10:10:04 (pid:475690) Buf::write(): condor_write() failed
07/15/20 10:10:04 (pid:475690) SECMAN: Error sending response classad to <xx.xxx.128.20:3301>!
NewSession = "YES"
Subsystem = "TOOL"
AuthMethods = "FS,KERBEROS,GSI,CLAIMTOBE"
CryptoMethods = "BLOWFISH,3DES"
Authentication = "OPTIONAL"
Integrity = "OPTIONAL"
Command = 519
Encryption = "OPTIONAL"
ServerPid = 1029234
SessionDuration = "60"
OutgoingNegotiation = "PREFERRED"
Enact = "NO"
SessionLease = 3600
RemoteVersion = "$CondorVersion: 8.8.7 Dec 24 2019 BuildID: 493225 PackageID: 8.8.7-1 $"


07/15/20 10:34:01 (pid:475690) Number of Active Workers 0
07/15/20 10:34:01 (pid:475690) Number of Active Workers 0
07/15/20 10:34:15 (pid:475690) Can't find address for startd condor1-submit.c.example.internal

Has anyone encountered this issue before? any inputs are highly appreciated.Â

Thanks & Regards,
Vikrant Aggarwal