[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Python binding crashes



Hi All,

I am running into issues when submitting lots of jobs (tens of
thousands) from the python bindings. 

The submit code looks like

schedd = htcondor.Schedd()
for i in some_list:
   j = build_job_dict(i)
   schedd.submit(j)


Here is the ouput with debugging turned on. Lines starting with
"Processing .." is output from my code.


Tue Feb  2 16:13:58 2016 Processing A
02/02/16 16:15:18 condor_read(): timeout reading 5 bytes from
<10.x.xxx.xxx:12731>.
02/02/16 16:15:18 IO: Failed to read packet header
02/02/16 16:15:18 SECMAN: no classad from server, failing
02/02/16 16:15:18 ERROR: SECMAN:2004:Failed to create security session
to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
message.
Can't send RESCHEDULE command to schedd.
Tue Feb  2 16:16:46 2016 Processing B
02/02/16 16:18:43 condor_read(): timeout reading 5 bytes from
<10.x.xxx.xxx:12731>.
02/02/16 16:18:43 IO: Failed to read packet header
02/02/16 16:18:43 SECMAN: no classad from server, failing
02/02/16 16:18:43 ERROR: SECMAN:2004:Failed to create security session
to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
message.
Can't send RESCHEDULE command to schedd.
Tue Feb  2 16:20:13 2016 Processing C
02/02/16 16:22:10 condor_read(): timeout reading 5 bytes from
<10.x.xxx.xxx:12731>.
02/02/16 16:22:10 IO: Failed to read packet header
02/02/16 16:22:10 SECMAN: no classad from server, failing
02/02/16 16:22:10 ERROR: SECMAN:2004:Failed to create security session
to <10.x.xxx.xxx:12731> with TCP.|SECMAN:2007:Failed to end classad
message.
Can't send RESCHEDULE command to schedd.
02/02/16 16:22:10 condor_write() failed: send() 13 bytes to schedd at
<10.x.xxx.xxx:12731> returned -1, timeout=0, errno=32 Broken pipe.
02/02/16 16:22:10 Buf::write(): condor_write() failed
terminate called after throwing an instance of
'boost::python::error_already_set'
Aborted


My initial suspicion was that I was running a lot of jobs which finished
very fast and thrashed the schedd process. But then I killed all my
workers and simply tried to queue jobs and got the same error. This is
not a one off occurrence and happens pretty deterministically.

Any idea what is going on?


Both htcondor and python bindings are for 8.4.3

Installed Packages
Name        : condor-python
Arch        : x86_64
Version     : 8.4.3
Release     : 1.el7
Size        : 4.8 M
Repo        : installed
>From repo   : htcondor-stable
Summary     : Python bindings for HTCondor.
URL         : http://www.cs.wisc.edu/condor/
License     : ASL 2.0
Description : The python bindings allow one to directly invoke the C++
implementations of
            : the ClassAd library and HTCondor from python


Thanks,
S