[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to connect to schedd



The python script doing the job submits is running on a different
physical machine as the Schedd and it's running inside a docker
container. I will try and see if I can test it outside the container
and see if I get the same behavour.

No, with the current structure of the program it's not feasible to
queue all the jobs in the same transaction object - I would have to
refactor it a bit for that.

On Tue, Jan 9, 2018 at 9:54 AM, Jason Patton <jpatton@xxxxxxxxxxx> wrote:
> I can't reproduce this using a loop with 1000 jobs in a personal
> condor. Is this also using a remote Schedd?
>
> If it makes sense to do so, do you get the same behavior if you put
> "with schedd.transaction() as txn" outside the loop and queue all your
> jobs with the same transaction object?
>
> The number of slots available and MAX_JOBS_RUNNING shouldn't matter.
>
> Jason
>
> On Sun, Jan 7, 2018 at 3:32 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>> I am submitting jobs from python in a loop that has this:
>>
>>     sub = htcondor.Submit(submit_dict)
>>     with schedd.transaction() as txn:
>>         id = sub.queue(txn)
>>
>> I want to submit thousands of jobs, each one with a different
>> submit_dict. What happens is the first 24 get submitted, then I start
>> to get 'Failed to connect to schedd' from the call to
>> schedd.transaction().
>>
>> I'll get that twice, then I can submit 12 jobs, then I get the error
>> once, then I can submit 6 jobs. It continues like this, a few errors,
>> a few successful submits.
>>
>> This is my MAX_JOBS_RUNNING setting on the master:
>>
>> condor_config_val MAX_JOBS_RUNNING
>> MIN({23933, 10000})
>>
>> And this is it on both execute hosts:
>>
>> condor_config_val MAX_JOBS_RUNNING
>> MIN({128651, 10000})
>>
>> condor_status shows 352 slots available.
>>
>> I don't see any errors in the submit log. Anyone know how I can fix
>> this and/or debug it further?