[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to connect to schedd



I tried this on bare metal outside the docker container and I got the
same errors. One thing I did not mention in my original post is that
this is a threaded app, and each

    with schedd.transaction() as txn:
        id = sub.queue(txn)

is in a different thread. I found this post in the ML archives:

https://www-auth.cs.wisc.edu/lists/htcondor-users/2016-October/msg00079.shtml

Are there any issues with the thread safety of this?

On Tue, Jan 9, 2018 at 10:20 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> The python script doing the job submits is running on a different
> physical machine as the Schedd and it's running inside a docker
> container. I will try and see if I can test it outside the container
> and see if I get the same behavour.
>
> No, with the current structure of the program it's not feasible to
> queue all the jobs in the same transaction object - I would have to
> refactor it a bit for that.
>
> On Tue, Jan 9, 2018 at 9:54 AM, Jason Patton <jpatton@xxxxxxxxxxx> wrote:
>> I can't reproduce this using a loop with 1000 jobs in a personal
>> condor. Is this also using a remote Schedd?
>>
>> If it makes sense to do so, do you get the same behavior if you put
>> "with schedd.transaction() as txn" outside the loop and queue all your
>> jobs with the same transaction object?
>>
>> The number of slots available and MAX_JOBS_RUNNING shouldn't matter.
>>
>> Jason
>>
>> On Sun, Jan 7, 2018 at 3:32 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>>> I am submitting jobs from python in a loop that has this:
>>>
>>>     sub = htcondor.Submit(submit_dict)
>>>     with schedd.transaction() as txn:
>>>         id = sub.queue(txn)
>>>
>>> I want to submit thousands of jobs, each one with a different
>>> submit_dict. What happens is the first 24 get submitted, then I start
>>> to get 'Failed to connect to schedd' from the call to
>>> schedd.transaction().
>>>
>>> I'll get that twice, then I can submit 12 jobs, then I get the error
>>> once, then I can submit 6 jobs. It continues like this, a few errors,
>>> a few successful submits.
>>>
>>> This is my MAX_JOBS_RUNNING setting on the master:
>>>
>>> condor_config_val MAX_JOBS_RUNNING
>>> MIN({23933, 10000})
>>>
>>> And this is it on both execute hosts:
>>>
>>> condor_config_val MAX_JOBS_RUNNING
>>> MIN({128651, 10000})
>>>
>>> condor_status shows 352 slots available.
>>>
>>> I don't see any errors in the submit log. Anyone know how I can fix
>>> this and/or debug it further?