[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to connect to schedd



I can't reproduce this using a loop with 1000 jobs in a personal
condor. Is this also using a remote Schedd?

If it makes sense to do so, do you get the same behavior if you put
"with schedd.transaction() as txn" outside the loop and queue all your
jobs with the same transaction object?

The number of slots available and MAX_JOBS_RUNNING shouldn't matter.

Jason

On Sun, Jan 7, 2018 at 3:32 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> I am submitting jobs from python in a loop that has this:
>
>     sub = htcondor.Submit(submit_dict)
>     with schedd.transaction() as txn:
>         id = sub.queue(txn)
>
> I want to submit thousands of jobs, each one with a different
> submit_dict. What happens is the first 24 get submitted, then I start
> to get 'Failed to connect to schedd' from the call to
> schedd.transaction().
>
> I'll get that twice, then I can submit 12 jobs, then I get the error
> once, then I can submit 6 jobs. It continues like this, a few errors,
> a few successful submits.
>
> This is my MAX_JOBS_RUNNING setting on the master:
>
> condor_config_val MAX_JOBS_RUNNING
> MIN({23933, 10000})
>
> And this is it on both execute hosts:
>
> condor_config_val MAX_JOBS_RUNNING
> MIN({128651, 10000})
>
> condor_status shows 352 slots available.
>
> I don't see any errors in the submit log. Anyone know how I can fix
> this and/or debug it further?
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/