[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python binding crashes



Thanks! Is D_FULLDEBUG a config variable? 

I am using the default auth mechanism. TRUST_UID_DOMAIN is true. 


> On Feb 2, 2016, at 3:32 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> 
> Hi,
> 
> Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs
> 
> Did you increase the debug level of the SchedD as well, that would provide another view of the crash.
> 
> Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?
> 
> What auth mechanism are you using? GSI or something else?
> 
> Thanks,
> 
> Iain
> 
>> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
>> 
>> Hi All,
>> 
>> I am running into issues when submitting lots of jobs (tens of
>> thousands) from the python bindings. 
>> 
>> The submit code looks like
>> 
>> schedd = htcondor.Schedd()
>> for i in some_list:
>>  j = build_job_dict(i)
>>  schedd.submit(j)
>> 
>> 
>> Here is the ouput with debugging turned on. Lines starting with
>> "Processing .." is output from my code.
>> 
>> 
>> Tue Feb  2 16:13:58 2016 Processing A
>> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
>> <10.x.xxx.xxx:12731>.
>> 02/02/16 16:15:18 IO: Failed to read packet header
>> 02/02/16 16:15:18 SECMAN: no classad from server, failing
>> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>> message.
>> Can't send RESCHEDULE command to schedd.
>> Tue Feb  2 16:16:46 2016 Processing B
>> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
>> <10.x.xxx.xxx:12731>.
>> 02/02/16 16:18:43 IO: Failed to read packet header
>> 02/02/16 16:18:43 SECMAN: no classad from server, failing
>> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>> message.
>> Can't send RESCHEDULE command to schedd.
>> Tue Feb  2 16:20:13 2016 Processing C
>> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
>> <10.x.xxx.xxx:12731>.
>> 02/02/16 16:22:10 IO: Failed to read packet header
>> 02/02/16 16:22:10 SECMAN: no classad from server, failing
>> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
>> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
>> message.
>> Can't send RESCHEDULE command to schedd.
>> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
>> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
>> 02/02/16 16:22:10 Buf::write(): condor_write() failed
>> terminate called after throwing an instance of
>> 'boost::python::error_already_set'
>> Aborted
>> 
>> 
>> My initial suspicion was that I was running a lot of jobs which finished
>> very fast and thrashed the schedd process. But then I killed all my
>> workers and simply tried to queue jobs and got the same error. This is
>> not a one off occurrence and happens pretty deterministically.
>> 
>> Any idea what is going on?
>> 
>> 
>> Both htcondor and python bindings are for 8.4.3
>> 
>> Installed Packages
>> Name        : condor-python
>> Arch        : x86_64
>> Version     : 8.4.3
>> Release     : 1.el7
>> Size        : 4.8 M
>> Repo        : installed
>> From repo   : htcondor-stable
>> Summary     : Python bindings for HTCondor.
>> URL         : http://www.cs.wisc.edu/condor/
>> License     : ASL 2.0
>> Description : The python bindings allow one to directly invoke the C++
>> implementations of
>>           : the ClassAd library and HTCondor from python
>> 
>> 
>> Thanks,
>> S
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>