[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] have people seen scalability issues with condor submission using the python bindings?



thanks a lot for your comments. Also to Bob.

I cannot really submit all at once, or not easily, since each batch of
jobs is different from each other, and every thread has its own
timing, so not a good idea for them to wait for each other.
So I guess I can try to serialize. It would slow down a little bit, as
you point, but nothing dramatic.

Thanks a lot.
Jose


2018-08-21 12:59 GMT-04:00 John M Knoeller <johnkn@xxxxxxxxxxx>:
> The HTCondor schedd will only process one submit command at a time, but one submit command can submit thousands, or even tens of thousands of jobs. So the problem is not too many jobs, but one of too many threads, each with its own timeout in trying to contact the HTCondor schedd.
>
> If the threads just waited their turn before trying to contact the schedd, then this would work - but of course if you do that, you might as well be using a single thread to begin with.   A better way is to  submit multiple jobs with a single call into the bindings.  This will be a much lower burden on the schedd.  Calling the Submit.queue() method with an argument of 10 is MUCH lower overhead than calling Submit.queue() 10 times.
>
> If you can upgrade to 8.7, you should o that.  The 8.7 bindings have a new method queue_with_itemdata() which takes a python iterator as an argument, and will submit one or more jobs for each iteration, once again with MUCH lower overhead per job than calling queue() many times.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Bob Ball
> Sent: Tuesday, August 21, 2018 11:01 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Jose Caballero <jcaballero.hep@xxxxxxxxx>; Condor-Users Mail List <condor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] have people seen scalability issues with condor submission using the python bindings?
>
> I have not tried python, and not that version of Condor, but I have seen
> in the past where too many jobs submitted at once has overwhelmed Condor
> job submission.
>
> bob
>
> On 8/21/2018 11:38 AM, Jose Caballero wrote:
>> 2018-08-21 11:28 GMT-04:00 Jose Caballero <jcaballero.hep@xxxxxxxxx>:
>>> Hi,
>>>
>>> I am observing what I believe are some scale problems trying to submit
>>> using the python bindings.
>>> Version of condor is 8.6.12
>>> My application has multiple threads, and when they all try to submit
>>> almost at the same time, using the same Schedd, around 40% of them
>>> succeed and ~60% fail.
>>> I know I should write the code smarter, maybe some thread locking, or
>>> similar trick.
>>> But, in any case, I am just wondering if people have observed a
>>> similar behavior. And, in that case, how they fixed it.
>>>
>>> Cheers,
>>> Jose
>> I think I forgot to include the error message :)
>>
>>      with self.schedd.transaction() as txn:
>> RuntimeError: Failed to connect to schedd.
>>
>> where self.schedd is an instance of htcondor.Schedd()
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/