[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Python binding crashes



Hi,

Interesting, Iâve done large bulk submission from python bindings and not had it crash, although not on the scale of ten thousand jobs

Did you increase the debug level of the SchedD as well, that would provide another view of the crash.

Perhaps start with D_FULLDEBUG, D_SECURITY and go from there?

What auth mechanism are you using? GSI or something else?

Thanks,

Iain

> On Feb 2, 2016, at 21:25, Suchindra Sandhu <suchindra@xxxxxxxxx> wrote:
> 
> Hi All,
> 
> I am running into issues when submitting lots of jobs (tens of
> thousands) from the python bindings. 
> 
> The submit code looks like
> 
> schedd = htcondor.Schedd()
> for i in some_list:
>   j = build_job_dict(i)
>   schedd.submit(j)
> 
> 
> Here is the ouput with debugging turned on. Lines starting with
> "Processing .." is output from my code.
> 
> 
> Tue Feb  2 16:13:58 2016 Processing A
> 02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
> <10.x.xxx.xxx:12731>.
> 02/02/16 16:15:18 IO: Failed to read packet header
> 02/02/16 16:15:18 SECMAN: no classad from server, failing
> 02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> message.
> Can't send RESCHEDULE command to schedd.
> Tue Feb  2 16:16:46 2016 Processing B
> 02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
> <10.x.xxx.xxx:12731>.
> 02/02/16 16:18:43 IO: Failed to read packet header
> 02/02/16 16:18:43 SECMAN: no classad from server, failing
> 02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> message.
> Can't send RESCHEDULE command to schedd.
> Tue Feb  2 16:20:13 2016 Processing C
> 02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
> <10.x.xxx.xxx:12731>.
> 02/02/16 16:22:10 IO: Failed to read packet header
> 02/02/16 16:22:10 SECMAN: no classad from server, failing
> 02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
> to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
> message.
> Can't send RESCHEDULE command to schedd.
> 02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
> <10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
> 02/02/16 16:22:10 Buf::write(): condor_write() failed
> terminate called after throwing an instance of
> 'boost::python::error_already_set'
> Aborted
> 
> 
> My initial suspicion was that I was running a lot of jobs which finished
> very fast and thrashed the schedd process. But then I killed all my
> workers and simply tried to queue jobs and got the same error. This is
> not a one off occurrence and happens pretty deterministically.
> 
> Any idea what is going on?
> 
> 
> Both htcondor and python bindings are for 8.4.3
> 
> Installed Packages
> Name        : condor-python
> Arch        : x86_64
> Version     : 8.4.3
> Release     : 1.el7
> Size        : 4.8 M
> Repo        : installed
> From repo   : htcondor-stable
> Summary     : Python bindings for HTCondor.
> URL         : http://www.cs.wisc.edu/condor/
> License     : ASL 2.0
> Description : The python bindings allow one to directly invoke the C++
> implementations of
>            : the ClassAd library and HTCondor from python
> 
> 
> Thanks,
> S
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature