[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python binding crashes



It seems that schedd failed to send a response back to my submit client
(python) because the TCP connection between them breaks. Is it because I
am sending too many submit requests?


02/02/16 23:59:09 condor_write(): Socket closed when trying to write 310
bytes to <10.x.xxx.xxx:57170>, fd is 15
02/02/16 23:59:09 Buf::write(): condor_write() failed
02/02/16 23:59:09 SECMAN: Error sending response classad to
<10.x.xxx.xxx:57170>!
SessionDuration = "60"
AuthMethods = "FS,KERBEROS,GSI"
Command = 516
RemoteVersion = "$CondorVersion: 8.4.3 Dec 15 2015 BuildID: 352143 $"
SessionLease = 3600
OutgoingNegotiation = "PREFERRED"
NewSession = "YES"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "OPTIONAL"
Enact = "NO"
Subsystem = "TOOL"
Encryption = "OPTIONAL"
ServerPid = 2476702
Integrity = "OPTIONAL"


On Tue, Feb 2, 2016, at 05:21 PM, Suchindra Sandhu wrote:
> Will run with the debug flags and see. Meanwhile I don't have any auth
> mechanisms defined.
> 
> $ condor_config_val -dump | grep SEC_
> 
> SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
> SEC_CLAIMTOBE_USER =
> SEC_DEBUG_PRINT_KEYS = false
> SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
> SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
> SEC_INVALIDATE_SESSIONS_VIA_TCP = true
> SEC_PASSWORD_DOMAIN =
> SEC_PASSWORD_FILE =
> SEC_SESSION_DURATION_SLOP = 20
> SEC_TCP_SESSION_TIMEOUT = 20
> 
> 
> 
> On Tue, Feb 2, 2016, at 03:51 PM, Iain Bradford Steers wrote:
> > Ah, apologies should have been more specific.
> > 
> > Can you set a new config value in your condor config and then issue
> > condor_reconfig.
> > 
> > SCHEDD_DEBUG = D_FULLDEBUG, D_SECURITY
> > 
> > Also whatâs the output of:
> > 
> > ]$ condor_config_val -v SEC_DEFAULT_AUTHENTICATION_METHODS
> > 
> > and
> > 
> > ]$ condor_config_val -v SEC_WRITE_AUTHENTICATION_METHODS
> > 
> > Thanks,
> > 
> > Iain
> > 
> > 
> > > On Feb 2, 2016, at 21:44, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> > > 
> > > Thanks! Is D_FULLDEBUG a config variable? 
> > > 
> > > I am using the default auth mechanism. TRUST_UID_DOMAIN is true. 
> > > 
> > > 
> > >> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> > >> 
> > >> Hi,
> > >> 
> > >> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs
> > >> 
> > >> Did you increase the debug level of the SchedD as well, that would provide another view of the crash.
> > >> 
> > >> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?
> > >> 
> > >> What auth mechanism are you using? GSI or something else?
> > >> 
> > >> Thanks,
> > >> 
> > >> Iain
> > >> 
> > >>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> > >>> 
> > >>> Hi All,
> > >>> 
> > >>> I am running into issues when submitting lots of jobs (tens of
> > >>> thousands) from the python bindings. 
> > >>> 
> > >>> The submit code looks like
> > >>> 
> > >>> schedd = htcondor.Schedd()
> > >>> for i in some_list:
> > >>> j = build_job_dict(i)
> > >>> schedd.submit(j)
> > >>> 
> > >>> 
> > >>> Here is the ouput with debugging turned on. Lines starting with
> > >>> "Processing .." is output from my code.
> > >>> 
> > >>> 
> > >>> Tue Feb  2 16:13:58 2016 Processing A
> > >>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
> > >>> <10.x.xxx.xxx:12731>.
> > >>> 02/02/16 16:15:18 IO: Failed to read packet header
> > >>> 02/02/16 16:15:18 SECMAN: no classad from server, failing
> > >>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
> > >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> > >>> message.
> > >>> Can't send RESCHEDULE command to schedd.
> > >>> Tue Feb  2 16:16:46 2016 Processing B
> > >>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
> > >>> <10.x.xxx.xxx:12731>.
> > >>> 02/02/16 16:18:43 IO: Failed to read packet header
> > >>> 02/02/16 16:18:43 SECMAN: no classad from server, failing
> > >>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
> > >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> > >>> message.
> > >>> Can't send RESCHEDULE command to schedd.
> > >>> Tue Feb  2 16:20:13 2016 Processing C
> > >>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
> > >>> <10.x.xxx.xxx:12731>.
> > >>> 02/02/16 16:22:10 IO: Failed to read packet header
> > >>> 02/02/16 16:22:10 SECMAN: no classad from server, failing
> > >>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
> > >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> > >>> message.
> > >>> Can't send RESCHEDULE command to schedd.
> > >>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
> > >>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
> > >>> 02/02/16 16:22:10 Buf::write(): condor_write() failed
> > >>> terminate called after throwing an instance of
> > >>> 'boost::python::error_already_set'
> > >>> Aborted
> > >>> 
> > >>> 
> > >>> My initial suspicion was that I was running a lot of jobs which finished
> > >>> very fast and thrashed the schedd process. But then I killed all my
> > >>> workers and simply tried to queue jobs and got the same error. This is
> > >>> not a one off occurrence and happens pretty deterministically.
> > >>> 
> > >>> Any idea what is going on?
> > >>> 
> > >>> 
> > >>> Both htcondor and python bindings are for 8.4.3
> > >>> 
> > >>> Installed Packages
> > >>> Name        : condor-python
> > >>> Arch        : x86_64
> > >>> Version     : 8.4.3
> > >>> Release     : 1.el7
> > >>> Size        : 4.8 M
> > >>> Repo        : installed
> > >>> From repo   : htcondor-stable
> > >>> Summary     : Python bindings for HTCondor.
> > >>> URL         : http://www.cs.wisc.edu/condor/
> > >>> License     : ASL 2.0
> > >>> Description : The python bindings allow one to directly invoke the C++
> > >>> implementations of
> > >>>          : the ClassAd library and HTCondor from python
> > >>> 
> > >>> 
> > >>> Thanks,
> > >>> S
> > >>> _______________________________________________
> > >>> HTCondor-users mailing list
> > >>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > >>> subject: Unsubscribe
> > >>> You can also unsubscribe by visiting
> > >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > >>> 
> > >>> The archives can be found at:
> > >>> https://lists.cs.wisc.edu/archive/htcondor-users/
> > >> 
> > > 
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > Email had 1 attachment:
> > + smime.p7s
> >   4k (application/pkcs7-signature)
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/