[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python binding crashes



Will run with the debug flags and see. Meanwhile I don't have any auth
mechanisms defined.

$ condor_config_val -dump | grep SEC_

SEC_CLAIMTOBE_INCLUDE_DOMAIN = false
SEC_CLAIMTOBE_USER =
SEC_DEBUG_PRINT_KEYS = false
SEC_DEFAULT_AUTHENTICATION_TIMEOUT = 20
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = true
SEC_INVALIDATE_SESSIONS_VIA_TCP = true
SEC_PASSWORD_DOMAIN =
SEC_PASSWORD_FILE =
SEC_SESSION_DURATION_SLOP = 20
SEC_TCP_SESSION_TIMEOUT = 20



On Tue, Feb 2, 2016, at 03:51 PM, Iain Bradford Steers wrote:
> Ah, apologies should have been more specific.
> 
> Can you set a new config value in your condor config and then issue
> condor_reconfig.
> 
> SCHEDD_DEBUG = D_FULLDEBUG, D_SECURITY
> 
> Also whatâs the output of:
> 
> ]$ condor_config_val -v SEC_DEFAULT_AUTHENTICATION_METHODS
> 
> and
> 
> ]$ condor_config_val -v SEC_WRITE_AUTHENTICATION_METHODS
> 
> Thanks,
> 
> Iain
> 
> 
> > On Feb 2, 2016, at 21:44, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> > 
> > Thanks! Is D_FULLDEBUG a config variable? 
> > 
> > I am using the default auth mechanism. TRUST_UID_DOMAIN is true. 
> > 
> > 
> >> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> >> 
> >> Hi,
> >> 
> >> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs
> >> 
> >> Did you increase the debug level of the SchedD as well, that would provide another view of the crash.
> >> 
> >> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?
> >> 
> >> What auth mechanism are you using? GSI or something else?
> >> 
> >> Thanks,
> >> 
> >> Iain
> >> 
> >>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> >>> 
> >>> Hi All,
> >>> 
> >>> I am running into issues when submitting lots of jobs (tens of
> >>> thousands) from the python bindings. 
> >>> 
> >>> The submit code looks like
> >>> 
> >>> schedd = htcondor.Schedd()
> >>> for i in some_list:
> >>> j = build_job_dict(i)
> >>> schedd.submit(j)
> >>> 
> >>> 
> >>> Here is the ouput with debugging turned on. Lines starting with
> >>> "Processing .." is output from my code.
> >>> 
> >>> 
> >>> Tue Feb  2 16:13:58 2016 Processing A
> >>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
> >>> <10.x.xxx.xxx:12731>.
> >>> 02/02/16 16:15:18 IO: Failed to read packet header
> >>> 02/02/16 16:15:18 SECMAN: no classad from server, failing
> >>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
> >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> >>> message.
> >>> Can't send RESCHEDULE command to schedd.
> >>> Tue Feb  2 16:16:46 2016 Processing B
> >>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
> >>> <10.x.xxx.xxx:12731>.
> >>> 02/02/16 16:18:43 IO: Failed to read packet header
> >>> 02/02/16 16:18:43 SECMAN: no classad from server, failing
> >>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
> >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> >>> message.
> >>> Can't send RESCHEDULE command to schedd.
> >>> Tue Feb  2 16:20:13 2016 Processing C
> >>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
> >>> <10.x.xxx.xxx:12731>.
> >>> 02/02/16 16:22:10 IO: Failed to read packet header
> >>> 02/02/16 16:22:10 SECMAN: no classad from server, failing
> >>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
> >>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> >>> message.
> >>> Can't send RESCHEDULE command to schedd.
> >>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
> >>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
> >>> 02/02/16 16:22:10 Buf::write(): condor_write() failed
> >>> terminate called after throwing an instance of
> >>> 'boost::python::error_already_set'
> >>> Aborted
> >>> 
> >>> 
> >>> My initial suspicion was that I was running a lot of jobs which finished
> >>> very fast and thrashed the schedd process. But then I killed all my
> >>> workers and simply tried to queue jobs and got the same error. This is
> >>> not a one off occurrence and happens pretty deterministically.
> >>> 
> >>> Any idea what is going on?
> >>> 
> >>> 
> >>> Both htcondor and python bindings are for 8.4.3
> >>> 
> >>> Installed Packages
> >>> Name        : condor-python
> >>> Arch        : x86_64
> >>> Version     : 8.4.3
> >>> Release     : 1.el7
> >>> Size        : 4.8 M
> >>> Repo        : installed
> >>> From repo   : htcondor-stable
> >>> Summary     : Python bindings for HTCondor.
> >>> URL         : http://www.cs.wisc.edu/condor/
> >>> License     : ASL 2.0
> >>> Description : The python bindings allow one to directly invoke the C++
> >>> implementations of
> >>>          : the ClassAd library and HTCondor from python
> >>> 
> >>> 
> >>> Thanks,
> >>> S
> >>> _______________________________________________
> >>> HTCondor-users mailing list
> >>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >>> 
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/htcondor-users/
> >> 
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> Email had 1 attachment:
> + smime.p7s
>   4k (application/pkcs7-signature)