[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to connect to schedd



Thanks so much. This was indeed the issue. I refactored my code so
that all the condor jobs are submitted in one transaction and this
error was resolved.

On Thu, Jan 11, 2018 at 5:24 PM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> The schedd is not multi-threaded.   Once you have begun a transaction to the Schedd, it will not do any other work until that transaction completes.   By any other work, I include accepting new connections (i.e. contacting the Schedd); also processing queries, or for that matter, starting and completing jobs.  Nothing will happen until the transaction completes.
>
> So the multi-threaded nature of your app almost certainly explains the failures to contact the Schedd.   If you synchronize your threads so that only one is attempting to talk to the schedd at a time, the failures should go away.
>
> Since your script did not corrupt the schedd, It can fairly be described as thread *safe*.   But it is certainly not multi-threaded.
>
> If it's not too much work.  I would recommend that you redesign your script so it does all communication with any given schedd using only a single thread.  That includes queries as well as submits.
>
> You can safely have several threads each talking to different schedd's at the same time though.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Larry Martell
> Sent: Thursday, January 11, 2018 1:31 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Failed to connect to schedd
>
> I tried this on bare metal outside the docker container and I got the
> same errors. One thing I did not mention in my original post is that
> this is a threaded app, and each
>
>     with schedd.transaction() as txn:
>         id = sub.queue(txn)
>
> is in a different thread. I found this post in the ML archives:
>
> https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-October/msg00079.shtml
>
> Are there any issues with the thread safety of this?
>
> On Tue, Jan 9, 2018 at 10:20 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>> The python script doing the job submits is running on a different
>> physical machine as the Schedd and it's running inside a docker
>> container. I will try and see if I can test it outside the container
>> and see if I get the same behavour.
>>
>> No, with the current structure of the program it's not feasible to
>> queue all the jobs in the same transaction object - I would have to
>> refactor it a bit for that.
>>
>> On Tue, Jan 9, 2018 at 9:54 AM, Jason Patton <jpatton@xxxxxxxxxxx> wrote:
>>> I can't reproduce this using a loop with 1000 jobs in a personal
>>> condor. Is this also using a remote Schedd?
>>>
>>> If it makes sense to do so, do you get the same behavior if you put
>>> "with schedd.transaction() as txn" outside the loop and queue all your
>>> jobs with the same transaction object?
>>>
>>> The number of slots available and MAX_JOBS_RUNNING shouldn't matter.
>>>
>>> Jason
>>>
>>> On Sun, Jan 7, 2018 at 3:32 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>>>> I am submitting jobs from python in a loop that has this:
>>>>
>>>>     sub = htcondor.Submit(submit_dict)
>>>>     with schedd.transaction() as txn:
>>>>         id = sub.queue(txn)
>>>>
>>>> I want to submit thousands of jobs, each one with a different
>>>> submit_dict. What happens is the first 24 get submitted, then I start
>>>> to get 'Failed to connect to schedd' from the call to
>>>> schedd.transaction().
>>>>
>>>> I'll get that twice, then I can submit 12 jobs, then I get the error
>>>> once, then I can submit 6 jobs. It continues like this, a few errors,
>>>> a few successful submits.
>>>>
>>>> This is my MAX_JOBS_RUNNING setting on the master:
>>>>
>>>> condor_config_val MAX_JOBS_RUNNING
>>>> MIN({23933, 10000})
>>>>
>>>> And this is it on both execute hosts:
>>>>
>>>> condor_config_val MAX_JOBS_RUNNING
>>>> MIN({128651, 10000})
>>>>
>>>> condor_status shows 352 slots available.
>>>>
>>>> I don't see any errors in the submit log. Anyone know how I can fix
>>>> this and/or debug it further?