[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python binding crashes



Ah, apologies should have been more specific.

Can you set a new config value in your condor config and then issue condor_reconfig.

SCHEDD_DEBUG = D_FULLDEBUG, D_SECURITY

Also whatâs the output of:

]$ condor_config_val -v SEC_DEFAULT_AUTHENTICATION_METHODS

and

]$ condor_config_val -v SEC_WRITE_AUTHENTICATION_METHODS

Thanks,

Iain


> On Feb 2, 2016, at 21:44, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> 
> Thanks! Is D_FULLDEBUG a config variable? 
> 
> I am using the default auth mechanism. TRUST_UID_DOMAIN is true. 
> 
> 
>> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
>> 
>> Hi,
>> 
>> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs
>> 
>> Did you increase the debug level of the SchedD as well, that would provide another view of the crash.
>> 
>> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?
>> 
>> What auth mechanism are you using? GSI or something else?
>> 
>> Thanks,
>> 
>> Iain
>> 
>>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
>>> 
>>> Hi All,
>>> 
>>> I am running into issues when submitting lots of jobs (tens of
>>> thousands) from the python bindings. 
>>> 
>>> The submit code looks like
>>> 
>>> schedd = htcondor.Schedd()
>>> for i in some_list:
>>> j = build_job_dict(i)
>>> schedd.submit(j)
>>> 
>>> 
>>> Here is the ouput with debugging turned on. Lines starting with
>>> "Processing .." is output from my code.
>>> 
>>> 
>>> Tue Feb  2 16:13:58 2016 Processing A
>>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
>>> <10.x.xxx.xxx:12731>.
>>> 02/02/16 16:15:18 IO: Failed to read packet header
>>> 02/02/16 16:15:18 SECMAN: no classad from server, failing
>>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
>>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>>> message.
>>> Can't send RESCHEDULE command to schedd.
>>> Tue Feb  2 16:16:46 2016 Processing B
>>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
>>> <10.x.xxx.xxx:12731>.
>>> 02/02/16 16:18:43 IO: Failed to read packet header
>>> 02/02/16 16:18:43 SECMAN: no classad from server, failing
>>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
>>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>>> message.
>>> Can't send RESCHEDULE command to schedd.
>>> Tue Feb  2 16:20:13 2016 Processing C
>>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
>>> <10.x.xxx.xxx:12731>.
>>> 02/02/16 16:22:10 IO: Failed to read packet header
>>> 02/02/16 16:22:10 SECMAN: no classad from server, failing
>>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
>>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>>> message.
>>> Can't send RESCHEDULE command to schedd.
>>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
>>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
>>> 02/02/16 16:22:10 Buf::write(): condor_write() failed
>>> terminate called after throwing an instance of
>>> 'boost::python::error_already_set'
>>> Aborted
>>> 
>>> 
>>> My initial suspicion was that I was running a lot of jobs which finished
>>> very fast and thrashed the schedd process. But then I killed all my
>>> workers and simply tried to queue jobs and got the same error. This is
>>> not a one off occurrence and happens pretty deterministically.
>>> 
>>> Any idea what is going on?
>>> 
>>> 
>>> Both htcondor and python bindings are for 8.4.3
>>> 
>>> Installed Packages
>>> Name        : condor-python
>>> Arch        : x86_64
>>> Version     : 8.4.3
>>> Release     : 1.el7
>>> Size        : 4.8 M
>>> Repo        : installed
>>> From repo   : htcondor-stable
>>> Summary     : Python bindings for HTCondor.
>>> URL         : http://www.cs.wisc.edu/condor/
>>> License     : ASL 2.0
>>> Description : The python bindings allow one to directly invoke the C++
>>> implementations of
>>>          : the ClassAd library and HTCondor from python
>>> 
>>> 
>>> Thanks,
>>> S
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>> 
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature