[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] reporting bugs in condor-python bindings



Hi Xin,

Todd and I chatted on the phone a bit today to brainstorm potential causes of the issue you are seeing.  Particularly, other users don't report similar issues, so we were trying to decide what might be unique with your setup.

My best guess was some bad interaction between the multiprocessing module (I saw this loaded in your first traceback) and the htcondor module.  Two questions:

1.  Do some calls to the htcondor module occur in the parent process and some in the child process?  If so, does the issue go away if you only utilize the htcondor module from the child process?  Are you using any multithreading in either process?
2.  Is it possible to share the code (or a simplified version of the code?)?  Even if you can't post it publicly, just sharing it with the condor-admins list would help reproduce the issue.

Additionally, if you can run valgrind against the python process, it would be immensely useful to help debug.

Thanks!  Sorry we don't have a clear fix, but this will help us discover what's wrong.

Brian

> On Oct 18, 2017, at 8:27 AM, Xin Wang <xwang@xxxxxxxxxxxxx> wrote:
> 
> I did another two runs using python 2.6.
> Both runs hang instead of crash, and they hang within 10 minutes after they submitted tasks to condor through schedd.submitMany.
> 
> One hangs at the following:
> condorserver_2.6.py(208):                     params = list()
> condorserver_2.6.py(209):                     for entry in batch:
> 
> 
> -----Original Message-----
> From: Xin Wang
> Sent: Tuesday, October 17, 2017 2:31 PM
> To: Todd Tannenbaum <tannenba@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; 'htcondor-admin@xxxxxxxxxxx' <htcondor-admin@xxxxxxxxxxx>
> Cc: Andrew Georgiev <AGeorgiev@xxxxxxxxxxxxx>
> Subject: RE: [HTCondor-users] reporting bugs in condor-python bindings
> 
> Hi, Todd,
> 
> When using the bindings in the RPM with python 2.6, I was using HTCondor 8.6.6.
> 
> I did not recall how long exactly that particular run with python 2.6 lasted before it crashed, but it definitely worked properly for a few minutes at the very least and it did not crash from the very beginning.
> 
> I can probably set up another run in python 2.6 and let it run again later today or tomorrow and can report back.
> 
> Let me know if you need more details.
> 
> Thank you.
> 
> Xin
> 
> -----Original Message-----
> From: Todd Tannenbaum [mailto:tannenba@xxxxxxxxxxx]
> Sent: Tuesday, October 17, 2017 2:00 PM
> To: Xin Wang <xwang@xxxxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; 'htcondor-admin@xxxxxxxxxxx' <htcondor-admin@xxxxxxxxxxx>
> Cc: Andrew Georgiev <ageorgiev@xxxxxxxxxxxxx>
> Subject: Re: [HTCondor-users] reporting bugs in condor-python bindings
> 
> 
> [External Message]
> 
> 
> Hi Xin,
> 
> Thank you for the clarifications.
> 
> 
>> I guess I was not clear enough in my previous email and I would like to make some clarification here. When I tried the htcondor python bindings installed by RPM, I was using python 2.6. The daemon crashed at a place when a new empty list was being created in python, and the error message and the full stack trace were posted in my previous email. >
> Ignoring the Python 3 attempts for a bit and focusing just on Python 2...
> 
> So when you ran using the bindings in the RPM with Python 2.6 and had a
> crash, were you also using HTCondor v8.6.6 or some other version? I ask
> because in HTCondor v8.6.6 we upgraded the version of Boost.Python used
> in HTCondor.  Did your Python 2 attempts also run successfully for a few
> (minutes/hours/days?) before the crash?  Are you able to easily
> artificially reproduce the crash perhaps by having your daemon hit the
> python bindings at a high frequency (i.e. remove any sleep or event
> waits in your daemon so it continuously hits the bindings) ?
> 
> One last thought is we did clean up a few other misc things in HTCondor
> v8.7 while doing the work to enable Python 3... I wonder if the same
> problems still persist if you tried running your daemon with HTCondor
> v8.7.3+ instead of HTCondor v8.6....
> 
> While we do have a whole bunch of regression tests for our HTCondor
> Python bindings, it looks like we need to add some long running
> testing... (and/or figure out how to use tools like Coverity/Valgrind in
> the Boost.Python environment... shudder...)
> 
> Thanks again Xin for reporting all this and your help,
> regards,
> Todd
> 
> 
> Jefferies archives and monitors outgoing and incoming e-mail. The contents of this email, including any attachments, are confidential to the ordinary user of the email address to which it was addressed. If you are not the addressee of this email you may not copy, forward, disclose or otherwise use it or any part of it in any form whatsoever. This email may be produced at the request of regulators or in connection with civil litigation. Jefferies accepts no liability for any errors or omissions arising as a result of transmission. Use by other than intended recipients is prohibited. In the United Kingdom, Jefferies operates as Jefferies International Limited; registered in England: no. 1978621; registered office: Vintners Place, 68 Upper Thames Street, London EC4V 3BJ. Jefferies International Limited is authorized and regulated by the Financial Conduct Authority.
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/